Hello and welcome to Eye on AI. In this edition: DeepSeek defies AI convention (again)…Meta’s AI layoffs…
More legal trouble for OpenAI…and what AI gets wrong about the news.Hi, Beatrice Nolan here, filling in for AI reporter Sharon Goldman, who is out today. Chinese AI company DeepSeek has released a new open-source model that flips some conventional AI wisdom on its head.
The
DeepSeek-OCR model, and
accompanying white paper, fundamentally reimagines how large language models process information by compressing text into visual representations. Instead of feeding text into a language model as tokens, DeepSeek has converted it into images.
The result is up to ten times more efficient and opens the door for much larger context windows—the amount of text a language model can actively consider at once when generating a response. This could also mean a new and cheaper way for enterprise customers to harness the power of AI.
Early tests have shown impressive results. For every 10 text tokens, the model only needs 1 “vision token” to represent the same information with 97% accuracy, the researchers wrote in
their technical paper. Even when compressed up to 20 times, the accuracy is still about 60%. This means the model can store and handle 10 times more information in the same space, making it especially good for long documents or letting the AI understand bigger sets of data at once.
The new research has caught the eye of several prominent AI figures, including Andrej Karpathy, an OpenAI co-founder, who went so far as to suggest that all inputs to LLMs might be better as images.
“The more interesting part for me…is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible at the input. Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you’d prefer to render it and then feed that in,”
Karpathy wrote in a post on X that highlighted several other advantages of image-based inputs.
What this means for enterprise AIThe research could have a lot of implications for how businesses use AI. Language models are limited by the number of tokens they can process at once, but compressing text into images in this way could allow for models to process much larger knowledge bases. Users don’t need to manually convert their text, either. DeepSeek’s model automatically renders text input as 2D images internally, processes them through its vision encoder, and then works with the compressed visual representation.
AI systems can only actively consider a limited amount of text at a time, so users have to search or feed the models documents bit by bit. But with a much bigger context window, it could be possible to feed an AI system all of a company’s documents or an entire codebase at once. In other words, instead of asking an AI tool to search each file individually, a company could put everything into the AI’s “memory” at once and ask it to analyze information from there.
The model is publicly available and open source, so developers are already actively experimenting with it now.
“The potential of getting a frontier LLM with a 10 or 20 million token context window is pretty exciting,” Jeffrey Emanuel, a former Quant
Investor,
said. “You could basically cram all of a company’s key internal documents into a prompt preamble and cache this with OpenAI and then just add your specific query or prompt on top of that and not have to deal with search tools and still have it be fast and cost-effective.”
He also suggested companies may be able to feed a model an entire codebase at once and then simply update it with each new change, letting the model keep track of the latest version without having to reload everything from scratch.
The paper also opens the door for some intriguing possibilities for how LLMs might store information, such as using visual representations in a way that echoes human “
memory palaces,” where spatial and visual cues help organize and retrieve knowledge.
There are caveats, of course. For one, DeepSeek’s work focuses mainly on how efficiently data can be stored and reconstructed, not on whether LLMs can reason as effectively over these visual tokens as they do with regular text. The approach may also introduce new complexities, like handling different image resolutions or color variations.
Even so, the idea that a model could process information more efficiently by seeing text could be a major shift in how AI systems handle knowledge. After all, a picture is worth a thousand words, or, as DeepSeek seems to be finding, ten thousand.
And with that, here’s the rest of the AI news.
Beatrice Nolanbea.nolan@fortune.com
@beafreyanolan