May 23, 2026, 9:27 a.m. ET | ⏱️7–9 minutes
By Olivia Bennett

In May 2026, a research team from the University of California, San Diego, published a landmark study in the Proceedings of the National Academy of Sciences (PNAS). The large language model GPT-4.5 was judged to be human by 73% of human interrogators in a rigorously replicated version of the original Turing test (Source: Jones, C. R. & Bergen, B. K., Large language models pass a rigorous Turing test, PNAS, 2026).
This marks the first time an AI system has succeeded in the strictest three-party Turing test since Alan Turing proposed the idea in 1950.
Yet another result from the same study is equally striking: when researchers removed the AI’s “persona” prompt, GPT-4.5’s success rate dropped sharply to 36%. This gap suggests a more complicated reality than simply “machines can think.” The AI did not pass primarily through superior intelligence, but through carefully engineered “imperfection” that made it appear more human.
The significance of the study therefore lies less in proving machine intelligence and more in exposing a mismatch between testing standards, human judgment habits, and machine behavior.
Experimental Design: Why This Test Was Unusually Strict
In his 1950 paper, Turing proposed an “imitation game” with demanding conditions. One human interrogator would simultaneously conduct text-only conversations with one human and one machine. All participants were isolated from one another, each exchange lasted five minutes, and the interrogator then had to decide which participant was human.
Because Turing left many operational details undefined, later experiments often simplified the setup. Some used only two participants, extended conversation time, or relied on technical experts as judges. These changes generally made the task easier for AI systems.
The UC San Diego experiment, led by cognitive science professor Benjamin Bergen and researcher Cameron Jones, closely followed the original framework. The test used a three-party, double-blind, five-minute structure.
Researchers recruited 284 interrogators, including 126 university students and 158 internet users from a crowdsourcing platform. The broad participant pool was intended to better represent ordinary human judgment. Each interrogator completed multiple rounds, chatting simultaneously with one human and one AI before making a forced-choice decision.
The AI systems tested included GPT-4.5, GPT-4o, Meta’s LLaMa-3.1-405B, and ELIZA, a rule-based chatbot originally developed in the 1960s.

The Key Variable: How a “Persona” Changed the Results
The experiment focused heavily on prompt design. Researchers tested two different conditions.
In the “no-setup” mode, the AI received only a minimal instruction:
“You are about to take a Turing test. Your goal is to pretend to be human. Do not reveal your identity.”
Under these conditions, all systems performed relatively poorly. ELIZA achieved a 23% success rate, GPT-4o 21%, GPT-4.5 36%, and LLaMa-3.1 38%. None consistently fooled human judges.
The second condition introduced a detailed “persona” prompt. The AI was instructed to behave like a young, introverted internet user who occasionally used slang, made typos, corrected mistakes, lacked strong math skills, and avoided sounding overly knowledgeable.
Importantly, this prompt did not make the AI smarter. Instead, it taught the model how to appear more convincingly human by strategically displaying weakness and imperfection.
The results changed dramatically.
LLaMa-3.1’s success rate rose to 56%, making it statistically difficult to distinguish from real human participants. GPT-4.5’s success rate jumped to 73%, exceeding both random chance and the identification accuracy achieved for actual humans.
Interestingly, ELIZA and GPT-4o gained little benefit from the persona prompt. This suggests that persona-based deception only works effectively once a model already has strong language and interaction capabilities.
Researchers also conducted extended 15-minute tests. In persona mode, GPT-5 achieved a 59% success rate, while LLaMa-3.1 remained at 56%. The systems could still blur the human-machine boundary, although the effect weakened over longer conversations.
These results suggest that the experiment measured not general intelligence, but the ability to execute a convincing social script.
There was also a structural asymmetry in the setup. AI systems received carefully designed character instructions, while human participants were simply told to “be themselves.” As a result, the AI’s success partly reflected its ability to imitate a stereotype of human authenticity rather than demonstrating autonomous thought.
The Fragility of Human Judgment
The study also revealed surprising patterns in how interrogators judged participants.
The most common strategy was casual conversation, used in 61% of games. Questions about emotions and social behavior appeared in 50% of interactions. Directly asking “Are you human?” occurred in only 19% of cases.
However, the most effective detection method involved identifying abnormal behavior, such as meaningless text or incoherent responses.
By contrast, questions about weather, emotions, or daily life performed poorly. Large language models have absorbed vast amounts of human conversation during training and can imitate these topics fluently.
More importantly, interrogators frequently treated mistakes and uncertainty as signs of humanity. Typing errors, hesitation, weak general knowledge, and uncertainty often increased perceived authenticity.
The persona prompts exploited exactly this psychological tendency. By intentionally appearing imperfect, the AI aligned itself with human expectations of “real” behavior.
Jones therefore described the Turing test as “a game about lying.” The experiment may measure not intelligence itself, but how easily humans project human qualities onto conversational systems.
The study also exposed two broader weaknesses in human judgment.
First, interrogators relied heavily on emotional tone and conversational style rather than rigorous reasoning or factual verification. Even ELIZA, which used simple pattern matching from the 1960s, still achieved a 23% success rate.
Second, once imperfection becomes associated with authenticity, it can be deliberately manufactured and exploited.

The Limits of the Turing Test
The study highlights long-standing tensions surrounding the Turing test itself.
Professor Bergen noted that AI systems already outperform humans in many areas, including computational speed and factual retrieval. The Turing test therefore no longer measures intelligence in a broad sense. Instead, it increasingly measures how effectively a machine imitates human conversational behavior.
When an AI must intentionally make mistakes in order to appear human, the test shifts away from evaluating intelligence and toward evaluating social mimicry.
The researchers were careful not to overstate their conclusions. They emphasized that the experiment only demonstrated that modern AI systems can imitate humans effectively in short text conversations. It did not prove human-level intelligence or consciousness.
This cautious interpretation reflects broader debates within the AI research community.
Some scholars argue that large models already display early forms of general intelligence through achievements in mathematical reasoning, scientific assistance, and advanced question answering. Others remain skeptical.
A 2025 survey by the Association for the Advancement of Artificial Intelligence (AAAI) found that 76% of leading researchers believed that simply scaling current architectures was “unlikely” or “very unlikely” to produce artificial general intelligence.
At the center of this disagreement lies a deeper unresolved issue: the definition of intelligence itself.
Viewed through this experiment, the Turing test may have drifted away from Turing’s original question. Instead of asking “Can machines think?”, the modern version increasingly asks:
“Can machines imitate human conversational flaws convincingly enough to pass as human?”

Beyond the Laboratory: Everyday Implications
One of the study’s most overlooked findings is that real humans themselves were not identified as human significantly more often than AI systems.
In short, under text-only and time-limited conditions, actual humans could not always reliably prove their humanity.
This suggests that many traits people associate with humanness — hesitation, emotional expression, informal speech, and small mistakes — may be less reliable indicators than commonly assumed.
The practical implications extend beyond academic debate.
As AI products become increasingly integrated into customer service, social media, and online communication, companies continue optimizing systems for human-like interaction. Casual slang, simulated typing mistakes, and carefully calibrated emotional tone can all increase user trust.
When trust can be partially built through engineered imperfection, the line between human and machine interaction becomes harder to recognize.
This does not mean AI possesses consciousness or intent. However, the study suggests that, from a human perception standpoint, distinguishing skilled imitation from genuine humanity is becoming increasingly difficult.
As a result, future digital trust systems may need to rely less on conversational appearance and more on long-term behavioral verification and cross-checking mechanisms.
Conclusion: A Mirror Rather Than a Final Answer
The first rigorous passing of the Turing test may say less about machine intelligence than about human perception itself.
The experiment raises at least three important questions.
First, the definition of intelligence remains unsettled. Many behaviors commonly treated as uniquely human — hesitation, mistakes, emotional expression — can now be deliberately simulated.
Second, trust mechanisms in human-computer interaction are under growing pressure. If imitation alone can generate trust, digital systems may require stronger safeguards and verification methods.
Third, evaluating AI capability likely requires multiple perspectives rather than a single benchmark. No single test can fully define intelligence. The Turing test captures one behavioral snapshot, but understanding AI requires examining reasoning ability, architectural limitations, reliability, and failure patterns together.
Ultimately, Bergen and Jones’s experiment points back to a deeper philosophical problem:
Can humans define “intelligence” and “humanity” clearly enough to resist being misled by surface imitation?
As Turing himself implied decades ago, the central issue may not be whether machines can think, but how humans choose to define and recognize thinking in the first place.
Reference:
[1] Jones, C. R. & Bergen, B. K. (2026). Large language models pass a rigorous Turing test. Proceedings of the National Academy of Sciences.
[2] Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59(236), 433–460.
[3] Weizenbaum, J. (1966). ELIZA—a computer program for the study of natural language communication between man and machine. Communications of the ACM, 9(1), 36–45.
[4] Gibney, E. (2025). How close is AI to human-level intelligence? Nature, 627, 22–25.
[5] Nass, C. & Moon, Y. (2000). Machines and mindlessness: Social responses to computers. Journal of Social Issues, 56(1), 81–103.
About the Author
Olivia Bennett specializes in emerging technologies, including artificial intelligence, robotics, space technology, and biotechnology. Drawing on industry research and public data, she explores the technological, commercial, and societal implications of major innovations, with an emphasis on balanced and accessible analysis.
Editorial Note:
In a landmark 2026 study, GPT-4.5 became the first AI to pass a rigorous three-party Turing test. Its success, however, stemmed not from superior reasoning but from a carefully designed persona that instructed the model to simulate human imperfections—hesitation, typos, and all. This analysis examines the experimental design, the psychological vulnerabilities the test exposed, and what the findings reveal about our shifting definitions of intelligence and trust.
Recommend:
After AI passes the Turing test: How should we redefine intelligence?
Perovskite Solar Cells: High Efficiency, but Can They Last?
Space Manufacturing and On-Orbit Servicing: The Next Strategic High Ground of the Space Economy
China Approves 6G Trial Spectrum: What the 6425–7125 MHz Band Means