Meta Caught Cheating On AI Benchmarks

Meta released two new Llama 4 models, touting their “unparalleled, industry-leading performance,” but the company has been caught cheating on its benchmarks.

Meta is relatively unique in the AI space, developing some of the leading open source AI models, as opposed to OpenAI, Anthropic, and others, all of which are developing closed source models. Meta has been eager to prove that open source models can compete with the best the industry has to offer.

In its press release, Meta detailed the two new models.

These Llama 4 models mark the beginning of a new era for the Llama ecosystem. We designed two efficient models in the Llama 4 series, Llama 4 Scout, a 17 billion active parameter model with 16 experts, and Llama 4 Maverick, a 17 billion active parameter model with 128 experts. The former fits on a single H100 GPU (with Int4 quantization) while the latter fits on a single H100 host. We also trained a teacher model, Llama 4 Behemoth, that outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM-focused benchmarks such as MATH-500 and GPQA Diamond. While we’re not yet releasing Llama 4 Behemoth as it is still training, we’re excited to share more technical details about our approach.

Meta goes on to tout their performance, claiming they outperform GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro. The results quickly put Maverick in second place on the LM Arena site. The only problem is that it now appears Meta was cheating in its benchmarks.

The issue was spotted by a number of researchers who posted on X. As the researchers pointed out, Meta appears to have used an “experimental” version of Maverick for its benchmarks, not the version widely available to the public.

@TheXeophon confirmed chat model score was kind of fake news… "experimental chat version" pic.twitter.com/XxeDXwSBHw
— Nathan Lambert (@natolambert) April 6, 2025

https://twitter.com/suchenzang/status/1908938638869909724

this would explain it: "optimized for conversationality" pic.twitter.com/5iGPpFOIEF
— Zain (@ZainHasan6) April 6, 2025

Met has yet to comment on the discrepancy.

Meta Caught Cheating On AI Benchmarks

Notice an error?

Ready to get started?