Summarize

DeepSeek's new AI model generates 200,000 training pages per GPU

By Akash Pandey

Oct 21, 2025

04:00 pm

What's the story

Chinese artificial intelligence (AI) start-up DeepSeek has launched a new multimodal AI model, DeepSeek-OCR. The innovative system can process large and complex documents with far fewer tokens than conventional methods. DeepSeek-OCR employs visual perception to compress text for large language models (LLMs), making it more efficient than traditional text processing techniques.

Model capabilities

Overcoming long-context challenges in LLMs

DeepSeek discovered that using "vision encoders" to compress text for LLMs can enable them to process massive amounts of text at lower computing costs. The company said in a technical paper, "Through DeepSeek-OCR, we demonstrate that vision-text compression can achieve significant token reduction (7-20x) for different historical context stages." This breakthrough could help overcome long-context challenges in LLMs.

Model design

Two main components of DeepSeek-OCR

DeepSeek-OCR has two main components: a 380 million-parameter DeepEncoder that analyzes each image and produces a compressed version of it, and a 570 million-active parameter text generator built on top of another three billion-parameter mixture of experts (MoE) language model. The model was trained with 30 million PDF pages in around 100 languages, including Chinese and English, as well as synthetic diagrams, chemical formulae, and geometric figures.

Model efficiency

Model can handle various document types

DeepSeek-OCR can compress text by up to a factor of 10 while retaining 97% of the original information. The model can handle various document types, including plain text, diagrams, chemical formulae, and geometric figures, while preserving the original formatting. It can also output plain text or provide general image descriptions. However, the requirement of 'vision tokens' is likely to vary based on document size and image resolution.

Model evaluation

DeepSeek-OCR was tested on 2 benchmarks

DeepSeek-OCR can generate training data for LLMs and vision language models (VLMs) at a scale of over 200,000 pages per day on a single NVIDIA A100 GPU. The model was tested on two benchmarks: the OmniDocBench test for evaluating document parsing capabilities and the Fox benchmark test for assessing the focusing capabilities of vision language models on dense PDF documents.