In the rapidly evolving field of artificial intelligence, a provocative idea is gaining traction: What if large language models treated plain text not as sequences of tokens, but as visual images? This concept, explored in a recent blog post by software engineer Sean Goedecke on his site seangoedecke.com, stems from advancements in optical character recognition (OCR) technology. Goedecke delves into DeepSeek’s latest OCR paper, which proposes converting text into image format before processing, potentially revolutionizing how models handle long-form content.
The core argument hinges on efficiency. Traditional LLMs break text into tokens—sub-word units that can balloon in number for dense documents. DeepSeek’s approach, as Goedecke explains, renders text as images, then applies vision-based models to extract meaning. This method drastically cuts token counts; for instance, a lengthy PDF page might consume thousands of text tokens but only a fraction when visualized, according to insights from Tom’s Hardware, which reported up to 20 times fewer tokens in similar setups.
Challenging Conventional Tokenization: A Shift Toward Visual Efficiency
This visual paradigm isn’t just about compression—it’s a rethinking of multimodal AI. Goedecke points out that by treating text as pixels, models can leverage image-processing strengths, like those in diffusion models, to handle context more holistically. For example, spatial relationships in formatted text, such as tables or diagrams, are preserved better as images than as linear token streams.
Industry insiders might recall how transformer-based models, as detailed in Goedecke’s related post on diffusion models, already tokenize language into finite units. Extending this to images could unify text and visual data under one framework, reducing the computational overhead that plagues current systems.
Real-World Implications: Token Savings and Performance Gains
DeepSeek’s paper, highlighted in Goedecke’s analysis, demonstrates practical benefits through “vision-text compression.” By converting documents to images, models process them with fewer resources, making AI more accessible for edge devices or cost-sensitive applications. A Forbes article by Lance Eliot echoes this, suggesting that using images of text instead of pure text could enhance generative AI’s scalability.
Critics, however, question the trade-offs. Goedecke acknowledges potential accuracy losses in OCR extraction, especially for noisy or stylized text. Yet, emerging research, like the TiTok model from arXiv, which tokenizes images into compact 1D sequences as noted in An Image is Worth 32 Tokens, shows how such methods can achieve state-of-the-art reconstruction with minimal tokens—sometimes as few as 32 per image.
Broader Horizons: Integrating Modalities for Future AI
Looking ahead, this text-as-image strategy could bridge gaps in cross-modal generation. Papers on ResearchGate, such as one on FlowTok, propose flowing seamlessly between text and image tokens, reducing latent space by factors of three and accelerating sampling speeds. Goedecke’s post ties this to broader AI trends, where efficiency isn’t just a nice-to-have but a necessity amid soaring energy demands.
For enterprises, adopting these techniques means rethinking data pipelines. As Sider.ai’s blog on DeepSeek-OCR notes, token costs could drop by up to 10 times, enabling more ambitious applications like real-time document analysis. While challenges remain, the momentum suggests a future where AI blurs the lines between reading and seeing, potentially transforming how we interact with information.


WebProNews is an iEntry Publication