"48% Better, 17% Worse" -- The Cognitive Paradox in European Education

TwinLadder Casebook Series | TwinLadder | February 2026

The Hook

A student sits down to practice mathematics. She has access to a generative AI tutor -- a GPT-4-powered system that can answer any question, solve any problem, and explain any concept on demand. Over the course of the semester, she solves forty-eight percent more problems correctly than her classmates who do not have access. Her practice scores are excellent. Her confidence is high. She has never felt more capable.

Then the exam arrives. The AI is not available. She reads the first question and recognizes the format. She begins to solve it and stops. She can recall that the AI produced an answer to a problem like this one. She cannot recall why the answer was correct. She cannot reconstruct the reasoning. She cannot explain the concept behind the solution she submitted three weeks ago with complete confidence.

She scores seventeen percent lower than the students who never had AI assistance at all.

This is not a hypothetical. It is the central finding of a field experiment conducted with nearly one thousand students, published in the Proceedings of the National Academy of Sciences in 2025. Forty-eight percent better at doing. Seventeen percent worse at understanding. The gap between those two numbers is the cognitive paradox of AI in education -- and it is reshaping how European institutions think about what it means to learn.

The Story

The Experiment That Changed the Conversation

In the 2023-2024 academic year, researchers Hamsa Bastani, Osbert Bastani, and Alp Sungu -- affiliated with the Wharton School at the University of Pennsylvania -- deployed two AI tutoring systems among high school mathematics students in Turkey. The study, titled "Generative AI Without Guardrails Can Harm Learning," was designed to test a question that educators across Europe had been asking with increasing urgency: does AI help students learn, or does it help students perform?

The researchers created two conditions. The first, called "GPT Base," gave students access to a standard ChatGPT-style interface during their practice sessions -- they could ask any question and receive a direct answer. The second, called "GPT Tutor," provided a more structured experience, where the AI delivered teacher-designed hints rather than complete solutions. A control group received no AI assistance.

The results during the practice sessions were dramatic. Students using GPT Base solved forty-eight percent more problems correctly than the control group. Students using GPT Tutor performed even better, with a one hundred twenty-seven percent improvement. The AI was, by every immediate metric, making students more productive.

Then the researchers removed the AI and administered an examination. The GPT Base group -- the students who had used AI as a direct answer machine -- scored seventeen percent lower than the control group that had never had AI at all. The students who had appeared most competent during practice proved least competent when the assistance disappeared.

The GPT Tutor group, by contrast, showed no significant decline. The difference was the design of the interaction. When the AI gave answers, it destroyed learning. When the AI gave hints that required students to work through the problem themselves, it preserved it.

A Pattern Across Disciplines

The Bastani study did not emerge in isolation. It arrived at a moment when evidence of AI-induced cognitive decline was accumulating across multiple research programs.

In 2025, a review article published in Frontiers in Psychology by Jose B., Cherian J., Verghis A.M., and colleagues -- titled "The Cognitive Paradox of AI in Education: Between Enhancement and Erosion" -- synthesized the emerging evidence through the lens of cognitive load theory and self-determination theory. The authors identified a structural tension: AI tools reduce extraneous cognitive load (the mental effort spent on irrelevant tasks), which should free students for deeper learning. But in practice, AI also reduces germane cognitive load -- the effortful processing that is itself the mechanism through which deep learning occurs. The tool that removes the obstacles also removes the exercise.

A randomized controlled trial with one hundred twenty undergraduates, published in 2025, tested ChatGPT's impact on long-term knowledge retention. Students were randomly assigned either to use ChatGPT as a study aid or to rely on traditional study methods. A surprise retention test administered forty-five days later found that ChatGPT users scored 57.5 percent correct, compared to 68.5 percent for the traditional group -- a meaningful and persistent deficit attributable solely to the mode of study.

Research examining five hundred eighty Chinese university students found that greater AI dependence was associated with lower levels of critical thinking, with cognitive fatigue partially mediating the relationship. The more students leaned on AI, the less they thought for themselves -- not because they chose not to, but because the cognitive muscles required for critical thinking were not being exercised.

Europe Responds

European institutions have not ignored these findings. The EU AI Act, which entered into force in August 2024, classified AI systems used in education as high-risk under its regulatory framework. Article 4, applicable from February 2025, requires that organizations deploying AI systems ensure "a sufficient level of AI literacy" among staff and users. The Act bans certain AI practices in educational settings outright, including emotion recognition systems and manipulative systems that exploit students' vulnerabilities.

But the regulatory response addresses governance, not pedagogy. The deeper question -- how to design educational AI that builds understanding rather than bypassing it -- remains largely unanswered at the institutional level. The European University Association has acknowledged the challenge: AI is simultaneously the most powerful learning tool and the most potent learning threat that education has encountered in a generation. Faculty across European universities now face a spectrum of responses, from complete prohibition of AI in coursework to mandatory integration with attribution requirements. Neither extreme addresses the paradox. Banning AI prepares students for a world that no longer exists. Unrestricted AI access prepares students for dependency on tools they do not understand.

Professor Lorena Barba of George Washington University captured the dynamic precisely in her analysis of AI use in engineering courses: students consistently prioritize convenience over learning. When given the option to struggle with a problem or receive an instant answer, the overwhelming majority choose the answer. They are not being lazy. They are being rational -- optimizing for the metric they can see (the grade on this assignment) while sacrificing the capacity they cannot see (the understanding that would survive the removal of the tool).

Robert Bjork, the UCLA cognitive psychologist whose research on learning has shaped the field for four decades, would recognize this immediately. The students are experiencing what Bjork calls the "illusion of competence" -- the subjective feeling of mastery that arises from high retrieval strength, even when storage strength (durable, transferable knowledge) has not been built. The answer came easily. The ease felt like knowing. The knowing was an illusion.

Through the TwinLadder Lens

The TwinLadder framework, as described in TwinLadder's "Competence Paradox" white paper, defines four progressive levels of AI competence: Level 0 (AI Literacy), Level 1 (Professional Twin), Level 2 (Operational Twin), and Level 3 (Ecosystem Twin). The ladder is climbed, not skipped. Each level builds the human capacity that makes the next level sustainable.

The cognitive paradox in education is, at its core, a Level 0 failure. AI Literacy -- the foundational ability to critically evaluate what AI produces and, crucially, what it does not produce -- means understanding that a correct answer is not the same as understanding. It means recognizing that when an AI solves a problem for you, the AI has learned nothing and you have learned nothing. The computation occurred. The cognition did not.

The "illusion of competence" is the precise opposite of AI Literacy. It is the belief that you understand something because AI produced the right answer in your presence. The student who scores forty-eight percent better during AI-assisted practice and seventeen percent worse on the unassisted exam has not achieved AI Literacy. She has achieved AI dependency -- the condition in which the tool's competence is mistaken for one's own.

The TwinLadder's response to this problem is embedded in its Learning Exercise architecture, which draws directly on Bjork's research program on desirable difficulties. The architecture rests on four principles that are specifically designed to prevent the illusion of competence from forming.

First, the prediction-first interface. Before any AI recommendation is displayed, the learner must commit to their own assessment. What do they think the answer is? How confident are they? The prediction is locked. Only then does the AI output appear. The learning occurs in the gap between what the learner predicted and what the AI produced. This is the generation effect -- one of the most replicated findings in cognitive science -- operationalized as a design principle. Even incorrect predictions enhance subsequent learning, because they create a cognitive structure to which the correct answer can attach.

Second, interleaved scenarios. The TwinLadder does not organize learning into neat, sequential modules. It mixes problem types unpredictably within each session, forcing the learner to identify which type of problem they are facing before selecting a strategy. Bjork's research demonstrates that interleaved practice produces sixty-three percent retention on delayed assessments, compared to twenty percent for blocked practice. The blocked approach feels more effective during the session. It is dramatically less effective afterward.

Third, spaced challenge cycles. Rather than intensive workshops, the architecture distributes learning across weekly sessions with deliberate gaps. The forgetting that occurs between sessions is not a flaw -- it is the mechanism. Each session opens with retrieval of material from previous sessions, forcing the brain to reconstruct partially faded knowledge. That reconstruction effort is where durable learning happens.

Fourth, performance without the net. At regular intervals, the AI layer is disabled entirely. The learner must operate with the same data but without the AI's analysis. This is the true competence measure. It is also the direct answer to the Bastani study's findings: if students had been required to perform without AI periodically throughout the semester, the seventeen percent decline on the exam would have been detected and addressed long before the final assessment.

The Pattern

The education findings do not exist in isolation. They are one expression of a pattern that appears wherever AI removes the effortful practice through which human competence forms.

In software development, empirical evaluation of development teams found that less-experienced programmers demonstrated twenty-eight percent lower performance in algorithmic problem-solving when tested without AI support following six months of continuous GitHub Copilot use. The developers had not stopped writing code. They had stopped thinking about code. The Copilot handled the algorithmic reasoning. The developers handled the prompting. When the Copilot was removed, the reasoning capacity that had atrophied was precisely the capacity required.

At the societal level, the OECD's Survey of Adult Skills -- measuring approximately one hundred sixty thousand adults aged sixteen to sixty-five across thirty-one countries -- found that literacy and numeracy skills have declined or stagnated in most OECD countries between 2012 and 2023. Only Finland and Denmark showed significant improvements in adult literacy. In most countries, the lowest-performing ten percent of the population experienced the steepest decline. The causes are complex and predate generative AI, but the trajectory is clear: foundational cognitive skills are eroding in advanced economies at exactly the moment when AI tools are assuming more of the cognitive work those skills represent.

Lisanne Bainbridge identified this dynamic in 1983 in her landmark paper "Ironies of Automation," which has accumulated over 4,700 citations. Bainbridge observed that automating a process creates a paradox: the human operator, relieved of routine practice, loses the skills required to intervene when the automation fails. The formerly experienced operator becomes, through disuse, an inexperienced one. Efficient retrieval of knowledge from long-term memory depends on frequency of use, Bainbridge wrote. The less you practice, the less you can perform -- and the less you realize how much you have lost.

The pattern, across all of these domains, follows a consistent sequence: ease of use leads to disengagement, disengagement leads to skill decay, skill decay leads to dependency, and dependency makes the human unable to function when the tool is unavailable. The forty-eight percent improvement and the seventeen percent decline are not contradictory findings. They are the same finding, measured at two different moments in the cycle.

The Lesson

The lesson from the cognitive paradox is not that AI should be excluded from education or professional training. The Bastani study itself demonstrates why: the GPT Tutor condition -- where AI provided structured hints rather than direct answers -- produced significant performance gains with no measurable decline in independent competence. The problem is not the technology. The problem is the design.

Education and corporate training must incorporate what Robert Bjork calls "desirable difficulties" -- learning conditions that feel harder in the moment but produce dramatically better long-term retention. This means requiring students and professionals to generate their own answers before seeing the AI's output. It means interleaving problem types rather than organizing them into comfortable, sequential modules. It means spacing practice across time rather than concentrating it into intensive workshops. It means periodically removing the AI entirely and measuring whether the human can perform without it.

These principles are counterintuitive. They produce worse short-term metrics. Learners report lower confidence during interleaved practice even though their retention is three times higher. Students dislike being asked to predict before seeing the answer even though the prediction is what makes the answer meaningful. Organizations resist spaced training because it is logistically harder to schedule than a two-day immersion. Every design instinct in modern education and corporate training pushes toward ease, speed, and satisfaction scores. The cognitive science pushes in the opposite direction.

The TwinLadder's Learning Exercise framework is built on this science. It treats difficulty not as an obstacle to learning but as the design itself. AI should challenge human cognition, not bypass it. The tool that makes a learner feel most competent in the moment may be the tool that leaves them least competent when it matters.

The organizations and institutions that understand this distinction -- that build AI systems designed to strengthen the human, not merely to perform for the human -- will produce graduates and professionals who are genuinely competent. The rest will produce people who are forty-eight percent faster and seventeen percent more fragile.

Monday Morning Question: If you removed AI access from your team or your students for one week, would their performance reveal competence -- or would it reveal how much competence the AI has been quietly replacing?

Sources

Bastani, H., Bastani, O., Sungu, A., Ge, H., Kabakci, O., & Mariman, R. (2025). "Generative AI Without Guardrails Can Harm Learning: Evidence from High School Mathematics." Proceedings of the National Academy of Sciences, 122(26), e2422633122. https://www.pnas.org/doi/10.1073/pnas.2422633122
Jose, B., Cherian, J., Verghis, A.M., et al. (2025). "The Cognitive Paradox of AI in Education: Between Enhancement and Erosion." Frontiers in Psychology, 16, 1550621. https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2025.1550621/full
OECD (2024). "Adult Skills in Literacy and Numeracy Declining or Stagnating in Most OECD Countries." Survey of Adult Skills (PIAAC), December 2024. https://www.oecd.org/en/about/news/press-releases/2024/12/adult-skills-in-literacy-and-numeracy-declining-or-stagnating-in-most-oecd-countries.html
Bainbridge, L. (1983). "Ironies of Automation." Automatica, 19(6), 775-779. https://www.sciencedirect.com/science/article/abs/pii/0005109883900468
Bjork, R.A. & Bjork, E.L. (2020). "Desirable Difficulties in Theory and Practice." Journal of Applied Research in Memory and Cognition, 9(4), 475-479. https://bjorklab.psych.ucla.edu/wp-content/uploads/sites/13/2016/04/EBjork_RBjork_2011.pdf
European Union (2024). EU AI Act -- Regulatory Framework for Artificial Intelligence. Article 4: AI Literacy. https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai
Fernandes, D., et al. (2025). "AI Makes You Smarter But None the Wiser: The Disconnect Between Performance and Metacognition." Cited in RealKM, November 2025. https://realkm.com/2025/11/19/ai-is-changing-the-dunning-kruger-effect-with-higher-ai-literacy-correlating-with-overestimation-of-competence/