Xiaohongshu's "Everything is OCR": A 3B Small Model Outperforms Giants, Parsing Charts into Code

Paper: Multimodal OCR: Parse Anything from Documents
Institution: Huazhong University of Science and Technology & Xiaohongshu hi lab
Link: https://arxiv.org/abs/2603.13032
Code: https://github.com/rednote-hilab/dots.mocr

One-Sentence Summary

The Xiaohongshu team has proposed a groundbreaking new paradigm for document parsing—Multimodal OCR (MOCR). The core idea is simple yet revolutionary: charts, icons, UI interfaces, chemical formulas, and more within documents are no longer cropped out as mere "images" and discarded. Instead, they are directly parsed into executable SVG code.

They trained a mere 3B parameter model, dots.mocr, which has achieved astonishing results in both document parsing and graphic structuring.

MOCR Overview

▲ Figure 1: MOCR Overview. Input a document image; output a unified structured representation—text becomes Markdown, charts become SVG code.


Why This Paper Matters

1. A New Paradigm, Not Just a New SOTA

Traditional OCR systems (including the recently popular document LLMs) operate as follows:

  • Text → Recognized as text ✅
  • Tables → Recognized as structured markup ✅
  • Charts/Icons/Flowcharts/UI Screenshots → Cropped as images, end of story ❌

This means a vast amount of informative graphical elements in documents are effectively treated as black boxes and discarded. The "parsing result" you receive is inherently lossy.

MOCR says: No, graphics must be parsed too. And not by generating a text description, but by directly outputting renderable SVG code—which you can open in a browser, edit, and recombine.

Traditional OCR vs MOCR

▲ Figure 2: Traditional OCR only handles text, cropping graphics into pixels and discarding them; MOCR parses graphics into structured SVG code, achieving truly "lossless" document parsing.

This isn't just about "doing better"; it redefines "what document parsing should do."

2. A 3B Model Defeats a Host of Large Models

Let's look at the scorecard:

Performance Comparison

▲ Figure 3: Comprehensive performance of dots.mocr in document parsing and graphic parsing.

Document Parsing (Traditional OCR Direction):

  • On the OCR Arena Elo leaderboard, it ranks second only to Gemini 3 Pro, surpassing all open-source models.
  • Achieved a new SOTA of 83.9 on olmOCR-Bench.
  • Secured top scores in categories like ArXiv papers, tables, and multi-column layouts.

Graphic Structuring Parsing (SVG Direction):

  • On multiple benchmarks including Chart→SVG, UI→SVG, Scientific Illustration→SVG, and Chemical Structure→SVG...
  • It comprehensively surpasses Gemini 3 Pro.

A 3B model beating Gemini 3 Pro in graphic parsing demonstrates what? It shows that targeted architecture design and data engineering can crush general-purpose large models on specific tasks.

3. From Xiaohongshu—Hardcore Industrial Research

This paper comes from Xiaohongshu's hi lab. The first author and corresponding author are from Professor Bai Xiang's team at Huazhong University of Science and Technology (a top-tier team in OCR/Document Understanding). Both code and models are fully open-sourced.

Given Xiaohongshu's massive volume of image-text content requiring understanding and indexing, MOCR is likely not just a paper, but a core upgrade to their content understanding pipeline.


Technical Deep Dive

Architecture: Large Visual Encoder + Small Language Decoder

The architecture of dots.mocr is quite interesting:

  • Visual Encoder: 1.2B parameters, trained completely from scratch (not fine-tuned from existing ones), supporting native high-resolution input up to ~11 million pixels.
  • Language Decoder: Qwen2.5-1.5B, using the base version rather than the chat version for initialization.
  • Lightweight Connector: Bridges vision and language.

Why is the visual encoder so large? Because document parsing requires simultaneously seeing small font text and precisely locating graphic elements (markers in charts, connecting lines in flowcharts), demanding extremely high resolution.

Why use a base model instead of a chat model? Because MOCR needs to generate highly structured sequences (Markdown, LaTeX, SVG code). This is a completely different output distribution from "conversation," making training from a base model more appropriate.

Training Strategy: Three-Stage Progressive Pre-training

  1. Stage One: General vision-language alignment, teaching the language model to "see images."
  2. Stage Two: Mixed training—general vision data + text document parsing, building robust text OCR capabilities.
  3. Stage Three: Increasing the proportion of MOCR-specific tasks, especially Graphic→SVG parsing.

The three stages progressively increase input resolution to match increasingly difficult task requirements.

After pre-training, there is Instruction Fine-Tuning (SFT), divided into two versions:

  • dots.mocr: General version, balanced between document parsing and graphic parsing.
  • dots.mocr-svg: SVG-enhanced version, increasing the proportion of SVG data during the SFT phase.

Data Engine: Four Major Data Sources

This might be the most valuable part of the entire paper—Data Engineering Determines the Model's Ceiling.

  1. PDF Documents: Auto-labeled using their own dots.ocr, with stratified sampling by language/domain/layout complexity.
  2. Web Rendering: Crawling web pages and rendering them into images. HTML/DOM structures naturally provide alignment signals, and the abundance of native SVG elements in web pages serves directly as training data.
  3. SVG Graphic Resources: Collecting native SVG files from the web, cleaned and deduplicated (code-level + perceptual hashing) via svgo, with complexity-balanced sampling.
  4. General Vision Data: To maintain the model's general visual capabilities.

Particularly noteworthy is the processing of SVG data—a single image can have countless different SVG representations (code is not unique). The paper addresses this through normalization (canonicalization), viewBox standardization, and complexity control.

Evaluation Method: OCR Arena

Traditional metrics like WER and NED are too fragile for document parsing—slight format differences lead to excessive penalties. The paper proposes the OCR Arena evaluation framework:

  • Using Gemini 3 Flash as the judge.
  • Pairwise comparison of model outputs.
  • Bi-directional evaluation (AB and BA evaluated separately) to eliminate position bias.
  • Using the Elo rating system (similar to chess rankings) to generate the final leaderboard.
  • 1000 bootstrap resamplings to ensure statistical robustness.

This evaluation method itself is highly reference-worthy.


Actual Effect Demonstration

Let's look at some actual parsing cases:

Layout Analysis Results

▲ Figure 4: Layout analysis results of dots.mocr on various complex documents—accurately recognizing academic papers, newspapers, tables, and multi-language documents.

SVG Parsing Results

▲ Figure 5: Graphic parsing results of dots.mocr-svg. Inputting various icon images yields renderable SVG code with extremely high reconstruction quality.


Key Numbers

Metric
Result
Model Parameters
3B (Visual 1.2B + Language 1.5B + Connector)
Max Input Resolution
~11 Million Pixels
olmOCR-Bench
83.9 (New SOTA)
OCR Arena Ranking
#1 Open Source, #2 Overall (behind only Gemini 3 Pro)
Graphic→SVG
Surpassed Gemini 3 Pro on multiple benchmarks

My Thoughts

Not Just an OCR Improvement, But an Expansion of "Document Understanding"

Previously, discussions on document parsing defaulted to "extracting text." MOCR pushes the boundary to "extracting all structurable information." This has a direct impact on downstream RAG, knowledge base construction, and multimodal pre-training data production.

The Ingenuity of SVG as a Unified Representation

Why choose SVG over other formats? Because SVG is:

  • Renderable: Opens directly in browsers.
  • Editable: Colors, sizes, and text can be modified.
  • Searchable: Text within SVG is real text.
  • Composable: Multiple SVGs can be stitched together.
  • Self-describing: The code itself is a structured representation.

Using SVG as the unified output format for graphic parsing is a very elegant design choice.

Data Engineering >> Model Scale

The reason a 3B model can defeat general large models far larger than itself lies not in innovative architecture, but in:

  • A high-resolution visual encoder trained from scratch.
  • A carefully designed multi-stage training strategy.
  • An extremely meticulous data engine.

This reaffirms an industry consensus: for specific tasks, if data quality and engineering are done right, small models can completely crush large models.

Limitations

The paper honestly points out: currently, document parsing and SVG parsing need to be run twice (not end-to-end in one go), though future iterations should unify this into a single inference step. Additionally, there is room for improvement in scenarios like old scans and headers/footers.


Conclusion

MOCR represents a true paradigm shift in the field of document parsing—moving from "looking only at text" to "everything can be parsed." The Xiaohongshu team not only proposed this new paradigm but also proved its feasibility and competitiveness with a small 3B model.

Both code and models are open source. If you are working on document intelligence, this paper and model are worth serious study.

📄 Paper: https://arxiv.org/abs/2603.13032
💻 Code: https://github.com/rednote-hilab/dots.mocr

This article was AI-assisted and has been manually reviewed and proofread.

Related Articles

分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.