Key Takeaway
Chinese AI firm DeepSeek has released an open-source model, DeepSeek-OCR, that compresses text into visual tokens up to 20 times more efficiently than traditional text tokens. The approach raises fundamental questions about whether future language models should process text directly or through visual representations.
DeepSeek-OCR Release – Key Points
DeepSeek-OCR Release
In October 2025, DeepSeek launched DeepSeek-OCR as a fully open-source model on GitHub and Hugging Face, publishing weights, training code, and inference scripts. The technical paper positions the work as an exploration of “optical context compression.”
Compression Breakthrough
Instead of processing text as tokens, the model converts text into high-resolution images and processes them with vision encoders. This achieves compression ratios of 7–20×, with accuracy of 97.3% when representing 700–800 text tokens as just 100 vision tokens. Even at 20× compression, the model retained ~60% accuracy, demonstrating the trade-off between efficiency and precision.
Architecture
The system combines:
• DeepEncoder, a 380M-parameter vision encoder that merges Meta’s SAM for local perception and OpenAI’s CLIP for global visual understanding, connected through a 16× compression module.
• DeepSeek3B-MoE-A570M, a 3B-parameter mixture-of-experts language decoder with 570M active parameters, which uses sub-networks of specialized “experts” to handle task subsets.
Multiple resolution modes are supported, including a dynamic “Gundam” mode that blends tiled local views with a global perspective for complex documents.
Efficiency at Scale
The model dramatically reduces hardware requirements. A single Nvidia A100-40G GPU can process ~200,000 pages per day, while a cluster of 20 servers with 160 GPUs reaches ~33 million pages daily, enabling fast dataset creation for AI training.
Performance vs Competitors
On OmniDocBench, DeepSeek-OCR outperformed GOT-OCR2.0 (which requires 256 tokens per page) while using only 100 vision tokens, and surpassed MinerU2.0, which averages over 6,000 tokens per page, while keeping under 800 vision tokens.
Context Window Expansion
By compressing text as images, the method could expand LLM context windows to 10–20 million tokens. This is an order of magnitude larger than current leaders: OpenAI GPT-5 (~400k), Anthropic Claude 4.5 (200k–1M beta), and Google Gemini 2.5 Pro (1–2M). This capability could allow entire corporate archives or large-scale knowledge bases to be embedded in a single prompt.
Cognitive & Tokenizer Implications
Andrej Karpathy highlighted that replacing text tokenizers with image processing removes long-standing inefficiencies: tokenizers inherit Unicode complexity, encoding quirks, and vulnerability to jailbreaks. Visual encoding also naturally preserves formatting, colors, layouts, and embedded charts. Researchers suggest this could mimic “computational forgetting,” where older contexts are downsampled into coarser resolutions while preserving essential meaning, similar to human memory.
Training Scale
DeepSeek-OCR was trained on 30 million PDF pages in about 100 languages (25 million in Chinese and English), along with 10 million synthetic charts, 5 million chemical formulas, and 1 million geometric figures. The training also included general vision and text-only data to maintain language balance. The process used 160 Nvidia A100 GPUs, sustaining a throughput of 70B tokens per day.
Industry Impact
The open-source release quickly attracted global attention, gaining thousands of GitHub stars within 24 hours. Media outlets emphasized its potential in domains like finance, science, and medicine, where handling tables, graphs, and charts is crucial. The release aligns with DeepSeek’s history of achieving competitive results at lower cost compared to Western rivals. DeepSeek-V3, for instance, was reported to cost just $5.6M for its final training run, although broader estimates place its infrastructure costs closer to $1.3B. The addition of vLLM support further simplifies adoption for developers.
Unresolved Questions
While the compression results are impressive, it is still unclear whether language models can reason over compressed visual tokens as effectively as with text tokens. Benchmarks show accuracy falls to ~60% at extreme compression levels (20×), and some researchers caution about introducing errors if the method is used to generate training data. Even modest compression levels of 1–2× with near-perfect accuracy could still provide meaningful efficiency gains.
Hands-on Availability
DeepSeek-OCR is freely available on GitHub and Hugging Face. Developers report smooth setup and strong performance across professional and even consumer-grade GPUs. Community guides and integrations are already emerging to accelerate experimentation.
Why This Matters
DeepSeek’s approach challenges the foundations of how AI models process information. By turning text into images, it reduces resource use, expands context limits, and avoids the pitfalls of traditional tokenization. If widely adopted, this method could reshape how multimodal models are trained and deployed, potentially making large-scale AI systems faster, cheaper, and more powerful. Its open-source release ensures researchers worldwide can test and adapt the technology, accelerating both competition and collaboration in the AI industry.
This article was drafted with the assistance of generative AI. All facts and details were reviewed and confirmed by an editor prior to publication.
DeepSeek introduces DeepSeek-GRM, enhancing AI reasoning via self-assessment techniques, marking a shift towards efficient, self-improving models.
DeepSeek-R2 combines multilingual, coding, multimodal, and math-solving power—now officially launched and poised to challenge global AI leaders.
Read a comprehensive monthly roundup of the latest AI news!






