
DeepSeek's new AI model generates 200,000 training pages per GPU
What's the story
Chinese artificial intelligence (AI) start-up DeepSeek has launched a new multimodal AI model, DeepSeek-OCR. The innovative system can process large and complex documents with far fewer tokens than conventional methods. DeepSeek-OCR employs visual perception to compress text for large language models (LLMs), making it more efficient than traditional text processing techniques.
Model capabilities
Overcoming long-context challenges in LLMs
DeepSeek discovered that using "vision encoders" to compress text for LLMs can enable them to process massive amounts of text at lower computing costs. The company said in a technical paper, "Through DeepSeek-OCR, we demonstrate that vision-text compression can achieve significant token reduction (7-20x) for different historical context stages." This breakthrough could help overcome long-context challenges in LLMs.
Model design
Two main components of DeepSeek-OCR
DeepSeek-OCR has two main components: a 380 million-parameter DeepEncoder that analyzes each image and produces a compressed version of it, and a 570 million-active parameter text generator built on top of another three billion-parameter mixture of experts (MoE) language model. The model was trained with 30 million PDF pages in around 100 languages, including Chinese and English, as well as synthetic diagrams, chemical formulae, and geometric figures.
Model efficiency
Model can handle various document types
DeepSeek-OCR can compress text by up to a factor of 10 while retaining 97% of the original information. The model can handle various document types, including plain text, diagrams, chemical formulae, and geometric figures, while preserving the original formatting. It can also output plain text or provide general image descriptions. However, the requirement of 'vision tokens' is likely to vary based on document size and image resolution.
Model evaluation
DeepSeek-OCR was tested on 2 benchmarks
DeepSeek-OCR can generate training data for LLMs and vision language models (VLMs) at a scale of over 200,000 pages per day on a single NVIDIA A100 GPU. The model was tested on two benchmarks: the OmniDocBench test for evaluating document parsing capabilities and the Fox benchmark test for assessing the focusing capabilities of vision language models on dense PDF documents.