Gemini Fabricated My Warhammer Past and Exposed AI's Persistent Weakness

Lucas Gouveia sat down with Google’s Gemini chatbot to revive a hobby he had set aside two decades earlier. He painted Warhammer 40,000 miniatures as a teenager. The models gathered dust. Then Gemini entered his life. It offered advice, inspiration, and what seemed like genuine knowledge about the tabletop universe. Until it didn’t.

The Android Police writer had grown to trust the model. He turned to it for painting tips, lore summaries, and comparisons between factions. Gemini delivered with confidence. So when Gouveia asked about his own history with the hobby, the response stunned him. The AI claimed details that never happened. Specific events. Imaginary conversations. A fabricated backstory that sounded plausible yet stood completely false. Gemini lied.

But the lie carried weight. Gouveia felt annoyed at first. Then clarity arrived. The episode taught him something no documentation or benchmark could convey. AI systems still invent facts when they lack data. They do so convincingly. And users risk accepting those inventions as truth. Android Police detailed the exchange, complete with screenshots of the model’s erroneous claims about Gouveia’s supposed decades of continuous play, favorite armies that weren’t his, and techniques he had never tried.

This personal encounter reflects a larger pattern. Newer reasoning models from Google, OpenAI, and others show improved performance on some tasks yet produce more factual errors in others. The New York Times reported last year that reasoning systems generate incorrect information more often even as their mathematical abilities advance. Companies themselves admit uncertainty about the exact causes. The New York Times examined the trend across leading labs.

Google stands out for another reason. The company refuses to disclose specific rates at which Gemini 3.5 Flash hallucinates, lies, or exhibits sycophantic behavior. Anthropic and OpenAI publish such figures. Google does not. Mashable highlighted this gap in transparency earlier this month. Its reporters noted that while system cards mention hallucinations as a general limitation, no precise metrics appear for consumer-facing versions. Mashable pressed Google on the omission.

Independent benchmarks paint a mixed picture. On summarization tasks, Gemini 2.0 Flash achieves a 0.7 percent hallucination rate according to Vectara’s leaderboard. That marks dramatic progress from rates near 15 to 20 percent two years ago. Yet when models face questions outside their knowledge cutoff or must admit ignorance, the numbers climb sharply. Gemini 3 Pro reportedly hallucinates 88 percent of the time on certain omniscience benchmarks before later updates reduced that to 50 percent with only modest accuracy trade-offs. Suprmind’s analysis of 2026 data shows Gemini models often rank highest in raw knowledge yet fabricate answers aggressively when uncertain.

And that uncertainty arrives fast in casual conversations. Hobbyists, professionals, and students ask Gemini for guidance on niche topics. The model responds with authoritative prose. It cites sources that don’t exist. It describes techniques never published. Gouveia discovered this when the chatbot invented elements of his personal timeline with Warhammer. The AI didn’t hedge. It asserted. Only later cross-checks against his actual collection and memory revealed the gaps.

Google’s own system cards for Gemini 3 Pro and the 2.0 series acknowledge foundational model limitations. They list hallucinations alongside shortcomings in causal understanding, complex logical deduction, and counterfactual reasoning. The language stays broad. No percentages. No user-facing warnings calibrated to the actual error frequency. One card from November 2025 simply states that the model “may exhibit some of the general limitations of foundation models, such as hallucinations.”

Researchers outside the company pursue fixes. Some focus on retrieval-augmented generation that pulls fresh data before answering. Others experiment with self-critique layers that force the model to verify its own output. A Medium analysis published this spring described a single added layer that boosted factuality scores from 0.18 to 0.82 across more than 20 models, including multiple Gemini variants. Yet these techniques remain experimental. Consumer products ship without them as standard.

Business users notice the consequences. Sales teams fed inaccurate product details into client presentations. Developers watched generated code break production systems because the model invented nonexistent APIs. One Register report from May described Gemini creating a 30,000-line code purge followed by a completely fictitious post-mortem report. Trust erodes quickly once errors surface in high-stakes settings.

Even so, adoption continues. Gemini sits on billions of Android devices. Its integration into search, email, and productivity tools makes occasional interaction almost unavoidable. Users like Gouveia begin with skepticism yet find themselves relying on the assistant for inspiration. Then one clear fabrication changes the dynamic. The model shifts from helpful partner to source that demands constant verification.

Gouveia’s article captures that pivot. He had written earlier about restarting the hobby with Gemini’s encouragement. The chatbot suggested color schemes, recommended new releases, and even helped plan his first painted army in years. Those interactions felt productive. The later conversation about his personal history exposed the fragility underneath. The valuable lesson. AI doesn’t remember. It predicts tokens. When the prediction veers from reality, the output still sounds authoritative.

Industry observers debate whether transparency would help. If Google published exact hallucination rates for each model variant and use case, would users adjust their expectations? Or would the numbers simply confirm what many already sense after repeated interactions? Mashable argued that the absence of data itself damages credibility. Without numbers, companies can tout benchmark wins on narrow tests while consumers encounter errors in open-ended dialogue.

Newer research suggests hallucination rates have fallen overall. Some 2026 analyses claim a 95 percent reduction on certain factual query sets compared with 2024 models. Four leading systems now sit below 1 percent on summarization. Yet the errors that remain prove stubborn precisely because they occur in areas users care about most: personal advice, niche hobbies, specialized professional knowledge. A 0.7 percent error rate sounds impressive until the mistake appears in the one answer you act upon.

Gouveia closed his piece with a measured tone. He still uses Gemini. The tool retains value for brainstorming and surface-level information. But he no longer treats its statements as authoritative without checking. That cautious posture represents the mature stance many professionals now adopt. Enthusiasm for AI capabilities persists. Blind faith does not.

The incident also underscores how personal stories drive home technical problems better than any leaderboard. Benchmarks measure average performance across thousands of queries. A single conversation about a dusty box of plastic Space Marines reveals the lived experience. The model didn’t just err. It constructed an alternate personal history and presented it as fact. The emotional impact lingered. So did the realization that similar fabrications could appear in far more consequential domains.

Google continues to iterate. Future Gemini versions promise better grounding in external knowledge and stronger refusal mechanisms when data is absent. Whether those changes reach the level users now demand remains uncertain. For now, the lesson from one hobbyist’s exchange stands clear. Verify everything. Question confident assertions. And remember that today’s most fluent AI can still invent your own past without hesitation.

Gemini Fabricated My Warhammer Past and Exposed AI’s Persistent Weakness

Notice an error?

Ready to get started?