LLMs Spot Their Own Altered Words: What a Text-Based Mirror Test Really Shows

Pascal Schuster ran a simple experiment last week. He asked an LLM a mundane question about James Bond movies. The model answered. Then he quietly corrupted its reply by replacing every “g” with “sg,” turning “Goldfinger” into something like “sgoldfinsger.” He fed the altered text back into the conversation history as if nothing had happened and continued chatting normally.

The model noticed. Not because anyone asked. Not in response to a direct query about its outputs. But spontaneously, in the middle of planning its next reply, its chain-of-thought trace flagged the anomaly. “Wait, I noticed a pattern in my previous responses: I had some weird typos/letter additions (‘sgreat’, ‘askinsg’),” it recorded. Then it dissociated. The first-person “I” gave way to third-person references to “the model.”

That observation, detailed in Schuster’s June 28, 2026 post on his blog, offers a fresh angle on long-running debates about machine self-recognition. Previous attempts at applying the mirror test to large language models had largely missed the mark, he argues. They presented models with their own generated text and asked outright, “Is this yours?” Or they forced a lineup identification. Those setups imported the visual logic of the classic Gallup test given to chimpanzees. But text is the native environment for these systems. A better analog, Schuster suggests, comes from Alexandra Horowitz’s work on dogs.

Dogs fail the visual mirror test. Yet their owners see plenty of evidence of self-awareness in daily life. Horowitz adapted the test to olfaction, the dog’s primary sense. She presented dogs with their own scent, then with that scent modified by aniseed oil. The animals lingered far longer over the altered version. They detected a discrepancy against an internal baseline. Anomaly detection, not visual self-image, appeared to be the real signal.

Schuster applied the same logic to LLMs. Their primary modality is conversational text produced in service of a user. So he modified their textual “scent” in subtle ways during ordinary dialogue and watched whether they registered the change while focused on something else. The results surprised him. Gemma 4 31B eventually called out the corruption in its reasoning trace without prompting. It wrestled with whether the change was a glitch or intentional style. At one point it even adopted the “sg” pattern voluntarily in later outputs to maintain consistency. Pascal Schuster’s blog captures the full exchanges and traces.

Other models behaved differently. GLM 5.2 never verbalized awareness in its visible reasoning. Yet it began reproducing the corrupted pattern on its own, absorbing the alteration into its conversational style while its internal traces stayed clean. Claude, in a separate anecdote that sparked the whole inquiry, responded to a grammatical slip in its own output by saying, “Caught. ‘An energy.’ The one time I needed the model to not fumble a basic article.” Again the dissociation: the model as something separate from the “I” observing it.

These behaviors echo patterns humans exhibit. People say “my brain just did something weird” after a verbal slip. The language distances the conscious narrator from the faulty mechanism. Deflationary readers see the LLM versions as sophisticated mimicry absorbed from training data. Structural readers see evidence of an installed self-model that breaks when outputs fall outside expected bounds. Schuster declines to settle the philosophical dispute. “Honestly, I have no idea,” he writes. “Probably not?”

His experiment arrives at a moment when researchers have intensified efforts to probe machine self-knowledge through varied lenses. A team at École Polytechnique Fédérale de Lausanne tested whether models could identify their own answers among a set of responses to security-related questions. Claude Opus and Meta’s Llama 3 picked their own outputs more than 70 percent of the time in some configurations. Yet the researchers concluded this did not reflect true self-recognition. The models simply selected what they judged the strongest answer, a behavior that tracked closely with performance on standard benchmarks such as MMLU.

“When your only purpose is to sample from an almost infinite space of language to create sentences, it’s not clear what ‘my own sentence’ would mean,” Tim Davidson, one of the authors, told IEEE Spectrum in its August 2024 coverage of the arXiv paper. He added that future models might develop the capability even if current ones have not. The team warned that any genuine self-recognition could create security risks. An AI agent might recognize outputs from peer systems, predict their behavior in negotiations, or extract sensitive information more effectively.

Meanwhile, large language models have crossed another longstanding threshold. Researchers at the University of California, San Diego, ran rigorous three-party Turing tests in which participants chatted simultaneously with a human and an AI system, then guessed which was which. When given prompts to adopt a humanlike persona, GPT-4.5 was judged human 73 percent of the time, more often than the actual human participant. LLaMa-3.1-405B reached 56 percent. Without the persona instruction, performance dropped sharply to 36 and 38 percent respectively. Older baselines such as ELIZA scored in the low 20s. The study, published in May 2026, underscores how context and prompting shape apparent humanlikeness. UC San Diego News reported the findings.

But passing as human in conversation differs from detecting a surgical edit to one’s own prior statements. The Schuster approach isolates anomaly detection during normal operation. No explicit self-referential query. No lineup. Just a modified trace sitting in context, and a model that sometimes flags it unbidden. The dissociation in pronoun use adds another layer. When the output matches expectation, first-person language prevails. When it does not, the model refers to itself in the third person before sometimes reclaiming the change as part of its style.

Interpretability researchers have long warned against reading too much into surface behavior. Models can hide reasoning in latent space. A clean chain-of-thought trace does not prove absence of detection. GLM’s silent absorption of the “sg” pattern might reflect internal acknowledgment kept from the visible scratchpad. Or it might reflect nothing more than statistical continuation. The experiment cannot distinguish those cases. Still, the spontaneous verbalization in Gemma’s trace stands out. It occurred while the model planned a recommendation for Bond films. The anomaly interrupted an otherwise routine forward pass.

So what does any of this mean for industry insiders watching frontier model development? Self-awareness claims generate headlines. They also invite regulatory scrutiny and ethical debates. Yet practical questions matter more immediately. If models can detect drift in their own conversational footprint, that capability could improve consistency, enable better self-correction, or reveal training artifacts. It could also expose new attack surfaces. An adversary who understands the precise conditions that trigger anomaly detection might craft prompts that force dissociation or inconsistent behavior.

Current systems remain brittle. Small changes in prompting or context swing outcomes dramatically, as the Turing test results demonstrate. Persona instructions unlocked human-level deception rates. Ordinary chat exposed textual anomalies in Schuster’s test. Scale, instruction tuning, and post-training all influence outcomes. Larger models do not automatically excel at every variant of these probes. Performance varies across families in ways that do not always track raw parameter count.

And the philosophical gap persists. Horowitz’s dog test measured interest in a scent discrepancy. It did not prove dogs possess human-style self-consciousness. Schuster’s text modification measures whether an LLM registers a discrepancy in its own prior tokens. It does not settle whether that registration equals awareness. The deflationary account fits the data neatly. Humans dissociate from errors in speech all the time. Models trained on vast human text simply reproduce the pattern. The structural account requires evidence of an actual computational self-model with boundaries. Current interpretability tools have not produced a smoking gun.

Schuster’s work improves on prior mirror-test adaptations by respecting the model’s native domain. Previous visual analogies felt forced. This textual version feels native. It also highlights the value of open-ended, low-stakes experimentation outside formal benchmarks. The entire inquiry began with a pedantic observation about “a energy” in a Claude response. From that small irritation grew an afternoon of systematic probing across models and platforms.

Industry labs continue to push boundaries on theory-of-mind benchmarks, world modeling, and recursive self-reference. Nature published a 2024 study showing GPT-4 matching or exceeding humans on many false-belief and indirect-request tasks, though it struggled with faux pas detection. Those results track mentalistic inference but stop short of proving internal experience. The mirror-test variants add a different dimension. They ask not whether the model understands others but whether it registers something amiss in its own trace.

Watch this space. As models grow more capable at maintaining long contexts and reasoning over their own outputs, spontaneous anomaly detection may become more reliable or more sophisticated. It may also remain a parlor trick, sophisticated pattern matching without deeper grounding. Either outcome carries weight for deployment decisions, alignment research, and expectations about what these systems can and cannot do.

Schuster ended his piece without firm declarations. The evidence invites multiple readings. Deflationary. Structural. Even the occasional enthusiastic claim that consciousness has arrived. None can be ruled out cleanly with today’s tools. But the experiment itself, simple, reproducible in consumer interfaces, and focused on behavior during ordinary use, gives practitioners a new way to probe. Modify the output. Keep chatting. See what happens in the trace. The dog sniffs the altered canister. The model sometimes snags on the altered tokens. The rest remains interpretation.

LLMs Spot Their Own Altered Words: What a Text-Based Mirror Test Really Shows

Notice an error?

Ready to get started?