Paper: Multimodal OCR: Parse Anything from Documents
Institution: Huazhong University of Science and Technology & Xiaohongshu hi lab
Link: https://arxiv.org/abs/2603.13032
Code: https://github.com/rednote-hilab/dots.mocr
One-Sentence Summary
The Xiaohongshu team has proposed a groundbreaking new paradigm for document parsing—Multimodal OCR (MOCR). The core idea is simple yet revolutionary: charts, icons, UI interfaces, chemical formulas, and more within documents are no longer cropped out as mere "images" and discarded. Instead, they are directly parsed into executable SVG code.
They trained a mere 3B parameter model, dots.mocr, which has achieved astonishing results in both document parsing and graphic structuring.
▲ Figure 1: MOCR Overview. Input a document image; output a unified structured representation—text becomes Markdown, charts become SVG code.
Why This Paper Matters
1. A New Paradigm, Not Just a New SOTA
Traditional OCR systems (including the recently popular document LLMs) operate as follows:
- Text → Recognized as text ✅
- Tables → Recognized as structured markup ✅
- Charts/Icons/Flowcharts/UI Screenshots → Cropped as images, end of story ❌
This means a vast amount of informative graphical elements in documents are effectively treated as black boxes and discarded. The "parsing result" you receive is inherently lossy.
MOCR says: No, graphics must be parsed too. And not by generating a text description, but by directly outputting renderable SVG code—which you can open in a browser, edit, and recombine.
▲ Figure 2: Traditional OCR only handles text, cropping graphics into pixels and discarding them; MOCR parses graphics into structured SVG code, achieving truly "lossless" document parsing.
This isn't just about "doing better"; it redefines "what document parsing should do."
2. A 3B Model Defeats a Host of Large Models
Let's look at the scorecard:
▲ Figure 3: Comprehensive performance of dots.mocr in document parsing and graphic parsing.
Document Parsing (Traditional OCR Direction):
- On the OCR Arena Elo leaderboard, it ranks second only to Gemini 3 Pro, surpassing all open-source models.
- Achieved a new SOTA of 83.9 on olmOCR-Bench.
- Secured top scores in categories like ArXiv papers, tables, and multi-column layouts.
Graphic Structuring Parsing (SVG Direction):
- On multiple benchmarks including Chart→SVG, UI→SVG, Scientific Illustration→SVG, and Chemical Structure→SVG...
- It comprehensively surpasses Gemini 3 Pro.
A 3B model beating Gemini 3 Pro in graphic parsing demonstrates what? It shows that targeted architecture design and data engineering can crush general-purpose large models on specific tasks.
3. From Xiaohongshu—Hardcore Industrial Research
This paper comes from Xiaohongshu's hi lab. The first author and corresponding author are from Professor Bai Xiang's team at Huazhong University of Science and Technology (a top-tier team in OCR/Document Understanding). Both code and models are fully open-sourced.
Given Xiaohongshu's massive volume of image-text content requiring understanding and indexing, MOCR is likely not just a paper, but a core upgrade to their content understanding pipeline.
Technical Deep Dive
Architecture: Large Visual Encoder + Small Language Decoder
The architecture of dots.mocr is quite interesting:
- Visual Encoder: 1.2B parameters, trained completely from scratch (not fine-tuned from existing ones), supporting native high-resolution input up to ~11 million pixels.
- Language Decoder: Qwen2.5-1.5B, using the base version rather than the chat version for initialization.
- Lightweight Connector: Bridges vision and language.
Why is the visual encoder so large? Because document parsing requires simultaneously seeing small font text and precisely locating graphic elements (markers in charts, connecting lines in flowcharts), demanding extremely high resolution.
Why use a base model instead of a chat model? Because MOCR needs to generate highly structured sequences (Markdown, LaTeX, SVG code). This is a completely different output distribution from "conversation," making training from a base model more appropriate.
Training Strategy: Three-Stage Progressive Pre-training
- Stage One: General vision-language alignment, teaching the language model to "see images."
- Stage Two: Mixed training—general vision data + text document parsing, building robust text OCR capabilities.
- Stage Three: Increasing the proportion of MOCR-specific tasks, especially Graphic→SVG parsing.
The three stages progressively increase input resolution to match increasingly difficult task requirements.
After pre-training, there is Instruction Fine-Tuning (SFT), divided into two versions:
dots.mocr: General version, balanced between document parsing and graphic parsing.dots.mocr-svg: SVG-enhanced version, increasing the proportion of SVG data during the SFT phase.
Data Engine: Four Major Data Sources
This might be the most valuable part of the entire paper—Data Engineering Determines the Model's Ceiling.
- PDF Documents: Auto-labeled using their own dots.ocr, with stratified sampling by language/domain/layout complexity.
- Web Rendering: Crawling web pages and rendering them into images. HTML/DOM structures naturally provide alignment signals, and the abundance of native SVG elements in web pages serves directly as training data.
- SVG Graphic Resources: Collecting native SVG files from the web, cleaned and deduplicated (code-level + perceptual hashing) via svgo, with complexity-balanced sampling.
- General Vision Data: To maintain the model's general visual capabilities.
Particularly noteworthy is the processing of SVG data—a single image can have countless different SVG representations (code is not unique). The paper addresses this through normalization (canonicalization), viewBox standardization, and complexity control.
Evaluation Method: OCR Arena
Traditional metrics like WER and NED are too fragile for document parsing—slight format differences lead to excessive penalties. The paper proposes the OCR Arena evaluation framework:
- Using Gemini 3 Flash as the judge.
- Pairwise comparison of model outputs.
- Bi-directional evaluation (AB and BA evaluated separately) to eliminate position bias.
- Using the Elo rating system (similar to chess rankings) to generate the final leaderboard.
- 1000 bootstrap resamplings to ensure statistical robustness.
This evaluation method itself is highly reference-worthy.
Actual Effect Demonstration
Let's look at some actual parsing cases:
▲ Figure 4: Layout analysis results of dots.mocr on various complex documents—accurately recognizing academic papers, newspapers, tables, and multi-language documents.
▲ Figure 5: Graphic parsing results of dots.mocr-svg. Inputting various icon images yields renderable SVG code with extremely high reconstruction quality.
Key Numbers
| 83.9 (New SOTA) | |
My Thoughts
Not Just an OCR Improvement, But an Expansion of "Document Understanding"
Previously, discussions on document parsing defaulted to "extracting text." MOCR pushes the boundary to "extracting all structurable information." This has a direct impact on downstream RAG, knowledge base construction, and multimodal pre-training data production.
The Ingenuity of SVG as a Unified Representation
Why choose SVG over other formats? Because SVG is:
- Renderable: Opens directly in browsers.
- Editable: Colors, sizes, and text can be modified.
- Searchable: Text within SVG is real text.
- Composable: Multiple SVGs can be stitched together.
- Self-describing: The code itself is a structured representation.
Using SVG as the unified output format for graphic parsing is a very elegant design choice.
Data Engineering >> Model Scale
The reason a 3B model can defeat general large models far larger than itself lies not in innovative architecture, but in:
- A high-resolution visual encoder trained from scratch.
- A carefully designed multi-stage training strategy.
- An extremely meticulous data engine.
This reaffirms an industry consensus: for specific tasks, if data quality and engineering are done right, small models can completely crush large models.
Limitations
The paper honestly points out: currently, document parsing and SVG parsing need to be run twice (not end-to-end in one go), though future iterations should unify this into a single inference step. Additionally, there is room for improvement in scenarios like old scans and headers/footers.
Conclusion
MOCR represents a true paradigm shift in the field of document parsing—moving from "looking only at text" to "everything can be parsed." The Xiaohongshu team not only proposed this new paradigm but also proved its feasibility and competitiveness with a small 3B model.
Both code and models are open source. If you are working on document intelligence, this paper and model are worth serious study.
📄 Paper: https://arxiv.org/abs/2603.13032
💻 Code: https://github.com/rednote-hilab/dots.mocr
This article was AI-assisted and has been manually reviewed and proofread.