Nanonets-OCR:Qwen2.5VL-3B的微调模型 更强大的文档解析能力|附效果实测

发布于:2025-06-25 ⋅ 阅读:(31) ⋅ 点赞:(0)

一 Nanonets-OCR

简介

Nanonets-OCR不再满足于单纯提取文本,它能智能解析图像中的公式、表格、水印、签名、图表、复选框等复杂结构,并输出格式清晰的 Markdown。

核心功能

LaTeX 公式识别:自动将文中数学公式转为标准 LaTeX 格式

智能图像描述:识别图表、二维码等内容并生成结构化描述

签名识别与隔离:可精准定位文档中的签名内容

水印提取:有效检测并提取文档中的水印信息

复选框识别:将复选框状态标准化为统一符号,便于后续处理

复杂表格提取:支持嵌套结构的表格识别,输出 Markdown/HTML 格式

训练过程

模型基于 25 万页图文数据训练,涵盖科研、金融、医疗、法律、发票、收据等多个行业,结合合成数据与人工标注,最终在 Qwen2.5-VL-3B 基础上完成精调。
注意事项:

● 微调数据中不包含手写数据,因此暂不支持手写体识别

● 仍有一定幻觉风险(模型大小限制)

二 效果测试

在线试用:https://huggingface.co/spaces/Souvik3333/Nanonets-ocr-s

● 论文封面(带水印)
论文封面-带水印

输出的Markdown结果:

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
Zhang Li¹, Yuliang Liu¹,†, Qiang Liu², Zhiyin Ma¹, Ziyang Zhang¹, Shuo Zhang¹, Zidun Guo¹, Jiarui Zhang², Xinyu Wang¹, Xiang Bai¹

¹Huazhong University of Science and Technology, ²Kingsoft Office

<img>
A bar chart comparing performance metrics across different datasets and models. The x-axis shows different document types (e.g., Formula (EN), Formula (ZH), Table (EN), Table (ZH), Exam paper, Academic Papers, Newspaper, Overall, Infer Speed) and their corresponding values. The y-axis shows the performance metric (e.g., accuracy, speed). The bars represent different models and their corresponding values.
</img>

Figure 1: Performance comparison of MonkeyOCR and other SOTA models on OmniDocBench [33]. “Overall” represents the comprehensive evaluation across nine document types in OmniDocBench.

Abstract
We introduce MonkeyOCR, a vision-language model for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline (as in MinerU’s modular approach) and avoids the inefficiencies of processing full pages with giant end-to-end models (e.g., large multimodal LLMs like Qwen-VL). In SRR, document parsing is abstracted into three fundamental questions – “Where is it?” (structure), “What is it?” (recognition), and “How is it organized?” (relation) – corresponding to layout analysis, content identification, and logical ordering. This focused decomposition balances accuracy and speed: it enables efficient, scalable processing without sacrificing precision. To train and evaluate this approach, we introduce the MonkeyDoc (the most comprehensive document parsing dataset to date), with 3.9 million instances spanning over ten document types in both Chinese and English. Experiments show that MonkeyOCR outperforms MinerU by an average of 5.1%, with particularly notable improvements on challenging content such as formulas (+15.0%) and tables (+8.6%). Remarkably, our 3B-parameter model surpasses much larger and top-performing models, including Qwen2.5-VL (72B) and Gemini 2.5 Pro, achieving state-of-the-art average performance on English document parsing tasks. In addition, MonkeyOCR processes multi-page documents significantly faster (0.84 pages per second compared to 0.65 for MinerU and 0.12 for Qwen2.5-VL-7B). The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.

<watermark>arXiv:2506.05218v1 [cs.CV] 5 Jun 2025</watermark>

Technical Report. †Project lead.

● 带表格和图片
带有图表的论文

输出的Markdown结果:

<table>
<thead>
<tr>
<th><strong>Model Type</strong></th>
<th><strong>Models</strong></th>
<th><strong>Book</strong></th>
<th><strong>Slides</strong></th>
<th><strong>Financial Report</strong></th>
<th><strong>Textbook</strong></th>
<th><strong>Exam Paper</strong></th>
<th><strong>Magazine</strong></th>
<th><strong>Academic Papers</strong></th>
<th><strong>Notes</strong></th>
<th><strong>Newspaper</strong></th>
<th><strong>Overall</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3"><strong>Pipeline Tools</strong></td>
<td>MinerU <sup>[43]</sup></td>
<td>0.055</td>
<td>0.124</td>
<td>0.033</td>
<td>0.102</td>
<td>0.159</td>
<td><strong>0.072</strong></td>
<td>0.025</td>
<td>0.984</td>
<td>0.171</td>
<td>0.206</td>
</tr>
<tr>
<td>Marker <sup>[35]</sup></td>
<td>0.074</td>
<td>0.340</td>
<td>0.089</td>
<td>0.319</td>
<td>0.452</td>
<td>0.153</td>
<td>0.059</td>
<td>0.651</td>
<td>0.192</td>
<td>0.274</td>
</tr>
<tr>
<td>Mathpix <sup>[26]</sup></td>
<td>0.131</td>
<td>0.220</td>
<td>0.202</td>
<td>0.216</td>
<td>0.278</td>
<td>0.147</td>
<td>0.091</td>
<td>0.634</td>
<td>0.690</td>
<td>0.300</td>
</tr>
<tr>
<td rowspan="3"><strong>Expert VLMs</strong></td>
<td>GOT-OCR <sup>[45]</sup></td>
<td>0.111</td>
<td>0.222</td>
<td>0.067</td>
<td>0.132</td>
<td>0.204</td>
<td>0.198</td>
<td>0.179</td>
<td>0.388</td>
<td>0.771</td>
<td>0.267</td>
</tr>
<tr>
<td>Nougat <sup>[3]</sup></td>
<td>0.734</td>
<td>0.958</td>
<td>1.000</td>
<td>0.820</td>
<td>0.930</td>
<td>0.830</td>
<td>0.214</td>
<td>0.991</td>
<td>0.871</td>
<td>0.806</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td rowspan="3"><strong>General VLMs</strong></td>
<td>GPT4o <sup>[32]</sup></td>
<td>0.157</td>
<td>0.163</td>
<td>0.348</td>
<td>0.187</td>
<td>0.281</td>
<td>0.173</td>
<td>0.146</td>
<td>0.607</td>
<td>0.751</td>
<td>0.316</td>
</tr>
<tr>
<td>Qwen2.5-VL-7B <sup>[1]</sup></td>
<td>0.148</td>
<td><strong>0.053</strong></td>
<td>0.111</td>
<td>0.137</td>
<td>0.189</td>
<td>0.117</td>
<td>0.134</td>
<td>0.204</td>
<td>0.706</td>
<td>0.205</td>
</tr>
<tr>
<td>InternVL3-8B <sup>[5]</sup></td>
<td>0.163</td>
<td><strong>0.056</strong></td>
<td>0.107</td>
<td>0.109</td>
<td><strong>0.129</strong></td>
<td>0.100</td>
<td>0.159</td>
<td><strong>0.150</strong></td>
<td>0.681</td>
<td>0.188</td>
</tr>
<tr>
<td rowspan="2"><strong>Mix</strong></td>
<td><strong>MonkeyOCR-3B</strong></td>
<td><strong>0.046</strong></td>
<td>0.120</td>
<td><strong>0.024</strong></td>
<td><strong>0.100</strong></td>
<td><strong>0.129</strong></td>
<td><strong>0.086</strong></td>
<td><strong>0.024</strong></td>
<td>0.643</td>
<td><strong>0.131</strong></td>
<td><strong>0.155</strong></td>
</tr>
<tr>
<td><strong>MonkeyOCR-3B*</strong></td>
<td>0.054</td>
<td>0.203</td>
<td>0.038</td>
<td>0.112</td>
<td>0.138</td>
<td>0.032</td>
<td><strong>0.194</strong></td>
<td>0.136</td>
<td><strong>0.120</strong></td>
<td></td>
</tr>
</tbody>
</table>

Table 3: The end-to-end text recognition performance on OmniDocBench across 9 PDF page types.
* represents the use of the layout model trained by us with improved capability for Chinese layout detection.

Specifically, it attained the highest end-to-end recognition accuracy in six categories. The 3B model outperformed InternVL3-8B <sup>[5]</sup> by 5% and surpassed MinerU <sup>[43]</sup> by 3.3% in overall accuracy. Notably, on the newspaper category, MonkeyOCR outperformed the previous state-of-the-art MinerU by 4%, demonstrating its strong capability in parsing dense and complex layouts. These results highlight MonkeyOCR’s superior generalization ability and robustness across various document types. Moreover, benefiting from enhanced Chinese language capabilities, MonkeyOCR* outperforms the original version by 44.9% on the notes category, achieving state-of-the-art overall performance.

5.3 Implement Details

During the training process, we utilize the AdamW optimizer with a learning rate of 2e-5 and a cosine learning rate schedule. We employ a batch size of 64. Our 3B model was trained for 53 hours on 32 A800 GPUs. By integrating with LMDeploy <sup>[7]</sup>, our model can successfully run on RTX 3090 GPUs.

&lt;img&gt;Bar chart comparing performance of MonkeyOCR-3B, Gemini2.0-flash, Gemini2.5-Pro, Qwen2-VL-72B, Qwen2.5-VL-72B, and InternVL2-76B across different document parsing tasks.&lt;/img&gt;
**Figure 6:** **End-to-end evaluation on OmniDocBench.** Performance comparison of MonkeyOCR with closed-source and extra-large open-source VLMs across different document parsing tasks.

6 Discussion

As is well-established, increasing model scale generally leads to improved performance. To further explore the potential of MonkeyOCR, we conducted comparative evaluations against both larger open-source models and leading closed-source commercial solutions on OmniDocBench. As illustrated in Figure 6, MonkeyOCR achieves the highest overall performance on English documents, outperforming Qwen2.5-VL-72B by 7.4% and surpassing the current state-of-the-art closed-source model, Gemini 2.5-Pro, by 0.8%. However, Gemini 2.5-Pro demonstrates slightly better performance on Chinese documents, indicating there is still some margin for improvement in MonkeyOCR’s Chinese document parsing capabilities.

甚至部分加粗都能识别正确
识别结果

● latex公式
带有公式的论文

输出的Markdown结果:

selecting high-scoring pages for inclusion. This results in a curated set of 951,000 high-quality samples.

**Manual Annotation for Chinese Documents.** To address the limited availability of Chinese region-level reading order annotations, we manually annotate a diverse set of Chinese documents, including research reports, academic papers, user manuals, books, test papers, slides, official documents, newspapers, journals, and contracts. This effort produces an additional 154,000 high-quality samples, substantially enhancing the representation of Chinese document scenarios.

**Expert Model-Based Auto-Annotation.** For datasets that provide only region-level bounding boxes without reading order information, we leverage expert models to generate region-level reading order annotations automatically. Specifically, we utilize PPOCR [17] for line-wise text recognition within each region, obtain text line positions, and then apply LayoutReader [44] to predict the reading order of these lines. The region-level order is determined by aggregating the predicted order of all text lines within each region. Through this approach, we generate 78,000 additional region-level annotations, further enriching the diversity and coverage of our dataset.

## 4 MonkeyOCR

&lt;img&gt;A diagram illustrating the overall architecture of MonkeyOCR. It shows a pipeline with four main stages: Diverse PDF Types, Structure, Recognition, and Relation. Each stage is represented by a box with an arrow pointing to the next stage. The Diverse PDF Types stage shows various types of documents like textbooks, exam papers, academic papers, books, newspapers, slides, notes, financial reports, and magazines. The Structure stage involves cropping the input image, followed by recognition and relation prediction. The Recognition stage extracts structured information from each region in parallel. The Relation stage determines the logical reading order of the detected elements. Finally, the output is serialized as HTML, JSON, or Markdown.&lt;/img&gt;

**Figure 5: The overall architecture of MonkeyOCR.** The system adopts a Structure-Recognition-Relation framework, consisting of structure detection, which locates and classifies semantic regions; block-level content recognition, which extracts structured information from each region in parallel; and relation prediction, which determines the logical reading order of the detected elements.

The proposed method, **MonkeyOCR**, addresses the fundamental limitations of both pipeline-based and end-to-end document parsing approaches by introducing a modular yet globally optimized Structure-Recognition-Relation (SRR) framework. As illustrated in Figure 5, we decompose the document parsing process into three relatively independent but tightly integrated stages: *structure detection*, *block-level content recognition*, and *relation prediction*. This design aims to mitigate the cumulative error typically observed in pipeline toolchains, while also improving inference efficiency by reducing the context length compared to monolithic end-to-end models.

In the first stage, a YOLO-based [49] document layout detector processes the input image $I \in \mathbb{R}^{H \times W \times 3}$, producing a set of bounding boxes $B = \{b_1, b_2, \ldots, b_n\}$ and their corresponding element types $T = \{t_1, t_2, \ldots, t_n\}$. Each bounding box $b_i = (x_{1i}, y_{1i}, x_{2i}, y_{2i})$ represents the spatial coordinates of the $i$-th element, and the element type $t_i \in \{\text{text}, \text{table}, \text{formula}, \text{figure}, \ldots\}$ specifies the category of the detected element.

For the second stage, we perform block-level content recognition in parallel. Each detected region $b_i$ is cropped and, together with a type-specific prompt $p_{t_i}$, is fed into our LMM for type-aware content extraction:

$$C = \text{LMM}( \{I^1_{\text{crop}}, I^2_{\text{crop}}, \ldots, I^n_{\text{crop}}\}, \{p_{t_1}, p_{t_2}, \ldots, p_{t_n}\}),$$

包含Latex的图片也不在话下,公式完全正确保存为Latex语法。

● 带水印
带有多个水印的论文

输出的Markdown结果:

selecting high-scoring pages for inclusion. This results in a curated set of 951,000 high-quality samples.

**Manual Annotation for Chinese Documents.** To address the limited availability of Chinese region-level reading order annotations, we manually annotate a diverse set of Chinese documents, including research reports, academic papers, user manuals, books, test papers, slides, official documents, newspapers, journals, and contracts. This effort produces an additional 154,000 high-quality samples, substantially enhancing the representation of Chinese document scenarios.

**Expert Model-Based Auto-Annotation.** For datasets that provide only region-level bounding boxes without reading order information, we leverage expert models to generate region-level reading order annotations automatically. Specifically, we utilize PPOCR [17] for line-wise text recognition within each region, obtain text line positions, and then apply LayoutReader [44] to predict the reading order of these lines. The region-level order is determined by aggregating the predicted order of all text lines within each region. Through this approach, we generate 78,000 additional region-level annotations, further enriching the diversity and coverage of our dataset.

## 4 MonkeyOCR

&lt;img&gt;A diagram illustrating the overall architecture of MonkeyOCR. It shows the process of converting diverse PDF types into structured data. The steps include structure detection, block-level content recognition, and relation prediction.&lt;/img&gt;

**Figure 5: The overall architecture of MonkeyOCR.** The system adopts a Structure-Recognition-Relation framework, consisting of structure detection, which locates and classifies semantic regions; block-level content recognition, which extracts structured information from each region in parallel; and relation prediction, which determines the logical reading order of the detected elements.

The proposed method, **MonkeyOCR**, addresses the fundamental limitations of both pipeline-based and end-to-end document parsing approaches by introducing a modular yet globally optimized Structure-Recognition-Relation (SRR) framework. As illustrated in Figure 5, we decompose the document parsing process into three relatively independent but tightly integrated stages: *structure detection*, *block-level content recognition*, and *relation prediction*. This design aims to mitigate the cumulative error typically observed in pipeline toolchains, while also improving inference efficiency by reducing the context length compared to monolithic end-to-end models.

In the first stage, a YOLO-based [49] document layout detector processes the input image $I \in \mathbb{R}^{H \times W \times 3}$, producing a set of bounding boxes $B = \{b_1, b_2, \ldots, b_n\}$ and their corresponding element types $T = \{t_1, t_2, \ldots, t_n\}$. Each bounding box $b_i = (x_{1i}, y_{1i}, x_{2i}, y_{2i})$ represents the spatial coordinates of the $i$-th element, and the element type $t_i \in \{\text{text}, \text{table}, \text{formula}, \text{figure}, \ldots\}$ specifies the category of the detected element.

For the second stage, we perform block-level content recognition in parallel. Each detected region $b_i$ is cropped and, together with a type-specific prompt $p_{ti}$, is fed into our LMM for type-aware content extraction:

$$C = \text{LMM}(\{I^1_\text{crop}, I^2_\text{crop}, \ldots, I^n_\text{crop}\}, \{p_{t_1}, p_{t_2}, \ldots, p_{t_n}\}),$$

水印完全去除,同时正文、图片以及公式等内容都正常显示。

三 总结

传统的Pipeline方式,只能检测出图片,无法处理图片的内容;相比之下,Nanonets-OCR模型,不只是看得见文字,更能从图片中提取出具体的语义信息,从而丰富文档的内容。

在一些高级RAG场景中,可以借助VLM的多模态能力,对图片进行总结,在召回阶段对图片的语义信息进行向量检索,即可召回相关的图片,增加RAG的可信度。


网站公告

今日签到

点亮在社区的每一天
去签到