On the Cutting Edge of AI: Google’s Pursuit of “Long Context” in Large Language Models

In the evolving field of artificial intelligence, one of the most technically challenging and transformative frontiers is the development of AI models that can understand and process vast amounts of information at once—what industry insiders refer to as “long context.” In a recent episode of the Google AI: Release Notes podcast, Logan Kilpatrick sits down with Nikolay Savinov, Staff Research Scientist at Google DeepMind, to delve into the journey and implications of scaling context windows in Google’s Gemini series of large language models (LLMs).

Decoding Tokens and Context Windows

At the heart of LLMs is the way they process information through “tokens.” As Savinov explains, in text this means a token is generally a little less than one word, sometimes just part of a word or even punctuation. Tokenization is a vital abstraction for computational efficiency. “While there are some benefits” to using character-level input, Savinov notes, “the most important drawback is… the generation is going to be slower,” since models generate output one token at a time.

Kilpatrick and Savinov point out that this fundamental difference from character-by-character processing means models see language very differently than humans. For example, counting the number of a certain letter in a word like “strawberry” becomes surprisingly difficult when the model’s view of the word is shaped by how the tokenizer breaks it down—a well-documented quirk that has led to a host of research papers and even humorous online examples.

The conversation moves from tokens to the more complex notion of the “context window”—the running memory of the AI model during any given task. Context windows allow LLMs to reason over several sources of knowledge:

In-weight (pre-training) memory: The factual world knowledge encoded during model training.
In-context memory: Information supplied at inference time, such as recent conversation history or uploaded documents.

Savinov underscores the flexibility of in-context memory: “It’s much, much easier to modify and update than in-weight memory.” This distinction is crucial for personalization, up-to-date information, and handling rare facts that may not have been present in the model’s original training data.

Retrieval Augmented Generation (RAG) vs. Long Context

For enterprise users, whose knowledge bases can span billions of tokens, systems like Retrieval Augmented Generation (RAG) will remain important. RAG pairs vector search across large external datasets with LLMs that process only a small, relevant subset of data retrieved for each query.

The debate of “RAG versus long context” is not an either/or, Savinov argues. He anticipates a future where the technologies are synergistic, “not like RAG is going to be eliminated… rather, long context and RAG are going to work together.” As context windows grow, RAG can be more generous in what it retrieves, improving the recall of the relevant information, while longer contexts reduce the need for aggressive filtering during retrieval.

Breaking the Million-Token Barrier

The Gemini family has made headlines for its dramatic improvements in context window size, shipping 1 million and then 2 million-token windows—an order of magnitude beyond previous state-of-the-art models. Savinov reveals that the leap was guided partially by ambition: “I originally thought… let’s set an ambitious bar. So I thought, well, 1 million is an ambitious enough step forward.” Soon after, 2 million was within reach.

Technically, testing pushed Gemini’s context capabilities as far as 10 million tokens in some cases, with “single-needle retrieval… almost perfect for the whole 10 million context.” Yet, inference at these sizes comes with prohibitive computational costs. “We could have shipped this model,” Savinov notes, reflecting on the 10 million-token tests. “But it’s pretty expensive to run this inference. So I guess we weren’t sure if people are ready to pay a lot of money for this.”

The constraint is as much about server costs and engineering as it is about model research. Going significantly further will require “more innovations. It’s not just a matter of brute-force scaling. To actually have close to perfect 10 million context, we need to learn more innovations,” says Savinov.

Quality Gains and Benchmarking

Following the major launches, the focus has shifted to improving retrieval quality, especially at high context sizes. Benchmarking the Gemini 2.5 Pro model showed notable gains over competitors, with evaluation designed to ensure apples-to-apples comparison at 128k tokens.

A specific challenge observed and now largely overcome by Gemini was the “lost in the middle” effect—a known issue where models struggle to retrieve facts from the center of a long prompt. While Gemini models largely avoid this, retrieval quality for more complex tasks with “hard distractors”—irrelevant information that closely resembles the target—can still decrease slightly as context size grows.

Savinov emphasizes that developers should not simply flood the context window with irrelevant data; irrelevant or similar distractors can degrade performance. Pre-filtering, while cumbersome for humans, would ideally be handled automatically as models and their agentic capabilities mature.

Needles, Haystacks, and Future Metrics

The initial “needle in a haystack” evaluation—retrieving a single fact from millions of tokens—has become a solved problem. The research frontier is now about handling multiple retrievals and hard distractors. Savinov notes the tradeoff between realism and measurement clarity in evals: too-realistic benchmarks may measure more about the task (such as code understanding) than about the pure context-handling ability.

Tasks requiring integration across the entire context, like summarization, define the next phase—but evaluating these remains tricky due to the subjectivity in defining “correct” outputs and the noisiness of metrics like ROUGE.

Long-Context and Its Research Integration

Long context research, once an independent workstream, is increasingly becoming fused with broader research directions like reasoning and factuality. Savinov expresses the need for specialized ownership of the capability, but also for collaboration tools that allow every team to contribute.

Reasoning, Long Outputs, and the Coding Frontier

Advanced reasoning tasks inherently benefit from long context capabilities. Not only can long context improve the accuracy of short-answer questions due to more available information, but allowing outputs to feed back into subsequent inputs unlocks a form of iterative reasoning not bounded by the network’s fixed depth.

Developers have also expressed strong demand for longer output windows—from the current 8,000 tokens to potential future capacities above 65,000 tokens. Savinov reveals that while there are no strict model limitations coming out of pre-training, achieving reliable and safe long outputs is an alignment and post-training challenge, particularly around where to place end-of-sequence tokens.

Complete codebase refactoring, large-scale translations, and end-to-end reasoning over massive textual inputs are the holy grails that seem attainable as context windows and output lengths increase in tandem.

Developer Best Practices: Caching, RAG, and Prompting

Savinov stresses the importance of context caching—reusing parts of the context across multiple queries to reduce both latency and cost. Especially in applications like “chat with your documents,” developers should ensure that new questions are appended after the context, allowing the model to intelligently cache and process only the new input. This alone, Savinov says, can reduce input token costs by up to a factor of four.

Other practical recommendations for developers include:

Combine long context with RAG when datasets exceed a few million tokens.
Avoid irrelevant or overly repetitive distractors to maximize retrieval quality.
When updating or qualifying knowledge in-context, explicitly tell the model to rely on the supplied information (“based on the information above…”) rather than defaulting to in-weight memory.
Weigh the tradeoffs between fine-tuning and in-context learning; fine-tuning can make inference cheaper and faster but is less flexible and harder to update.

The Next Three Years: Towards Automatic, Superhuman Recall

Looking ahead, Savinov forecasts “the quality of the current 1 or 2 million contexts is going to increase dramatically,” achieving near-perfect retrieval and opening “totally incredible applications… Like, the ability to process information and connect the dots, it will increase dramatically.” When the cost per token drops—with engineering and hardware advances catching up—windows of 10 million tokens will unlock effortless navigation and manipulation of truly large codebases.

The implications for software development are profound. As Savinov puts it: “The way humans are coding, well, you need to hold in memory as much as possible to be effective as a coder… LLMs are going to circumvent this problem completely.” Cutting-edge AI s could hold an entire codebase—spanning tens of millions of tokens—in memory at once, discovering subtle connections and performing tasks at a speed and scale unimaginable for human engineers.

Infrastructure Remains a Bottleneck

Improvements in underlying hardware and inference engineering remain critical. “Just having the chips is not enough. You also need very talented inference engineers,” Savinov asserts, crediting the DeepMind inference team with making the leap to million-token contexts possible for customers.

Agents as Context Managers

Ultimately, the tedium of supplying context—manually uploading files or paraphrasing information—will vanish. Savinov sees intelligent agent systems as both heavy consumers and suppliers of long context. They will not only require persistent memory to interact meaningfully over time, but will, via tool use, fetch, filter, and assemble relevant context dynamically, freeing users from the need to “bring all the context in by hand.”

—

As Google pushes its Gemini models to new frontiers in long context, the boundaries between memory, retrieval, reasoning, and interaction continue to blur. While the roadblocks are significant—from compute constraints to evaluation metrics to safe deployment—each milestone in long context AI models opens up new technical possibilities, with profound implications for knowledge work, software development, and the eventual vision of autonomous AI agents.