为什么要用 Markdown?以及如何使用它

发布于:2025-08-31 ⋅ 阅读:(17) ⋅ 点赞:(0)

在处理大量文档时,尤其是在构建知识库、进行文档分析或训练大语言模型(LLM)时,将各种格式的文件(如 PDF、Word、Excel、PPT、HTML 等)转换为统一的 Markdown 格式,能够显著提高处理效率和兼容性。Microsoft 的开源项目 MarkItDown 正是为此目的而生。

什么是 MarkItDown?

MarkItDown 是一个轻量级的 Python 工具,旨在将多种文件格式转换为 Markdown 格式,特别适用于 LLM 和相关的文本分析管道。与 Pandoc 等工具相比,MarkItDown 更加专注于保留文档的结构和内容,如标题、列表、表格、链接等,输出的 Markdown 格式既适合人类阅读,也适合机器处理

为什么选择 MarkItDown?

1. 多格式支持

MarkItDown 支持从多种格式转换为 Markdown,包括:

  • PDF

  • PowerPoint(PPTX)

  • Word(DOCX)

  • Excel(XLSX)

  • 图片(包括 EXIF 元数据和 OCR)

  • 音频(包括 EXIF 元数据和语音转录)

  • HTML

  • 文本格式(如 CSV、JSON、XML)

  • ZIP 文件(迭代处理其中的内容)

  • YouTube URL

  • EPUB

  • 以及更多

这种广泛的格式支持使得 MarkItDown 成为处理各种文档的利器。

html -> md ,内部转换是markdownify

2. 专为 LLM 优化

MarkItDown 输出的 Markdown 格式经过优化,适合用于 LLM 的输入,能够有效利用 LLM 的上下文窗口。此外,MarkItDown 还提供了 MCP(Model Context Protocol)服务器,支持与 LLM 的实时集成,例如与 Claude Desktop 的集成。

3. 强大的插件系统

MarkItDown 提供了插件系统,用户可以根据需要扩展功能。例如,有用户开发了插件来处理 DOCX 文件,并将其中的图片提取到指定文件夹中。

如何使用 MarkItDown?

安装 MarkItDown

使用 pip 安装:

pip install markitdown

如果想安装全部功能(支持更多格式和插件):

pip install 'markitdown[all]'

基本用法

假设你有一个 Word 文件 example.docx,想把它转换为 Markdown:

markitdown example.docx

输出内容会在终端显示。如果想保存为文件:

markitdown example.docx > example.md

python 代码

from markitdown import MarkItDown

md = MarkItDown(enable_plugins=False) # Set to True to enable plugins
result = md.convert("example.docx")
print(result.text_content)

example.docx

转换成的example.md

# Sample Document

This document was created using accessibility techniques for headings, lists, image alternate text, tables, and columns. It should be completely accessible using assistive technologies such as screen readers.

## Headings

There are eight section headings in this document. At the beginning, "Sample Document" is a level 1 heading. The main section headings, such as "Headings" and "Lists" are level 2 headings. The Tables section contains two sub-headings, "Simple Table" and "Complex Table," which are both level 3 headings.

## Lists

The following outline of the sections of this document is an ordered (numbered) list with six items. The fifth item, "Tables," contains a nested unordered (bulleted) list with two items.

1. Headings
2. Lists
3. Links
4. Images
5. Tables

* Simple Tables
* Complex Tables

1. Columns

## Links

In web documents, links can point different locations on the page, different pages, or even downloadable documents, such as Word documents or PDFs:

[Top of this Page](#_top)
[Sample Document](http://www.dhs.state.il.us/page.aspx?item=67072)
[Sample Document (docx)](http://www.dhs.state.il.us/OneNetLibrary/27897/documents/Initiatives/IITAA/Sample-Document.docx)

## Images

![Web Access Symbol](data:image/gif;base64...)Documents may contain images. For example, there is an image of the web accessibility symbol to the left of this paragraph. Its alternate text is "Web Access Symbol".

Alt text should communicate what an image means, not how it looks.

![Chart of Screen Reader Market Share. (Unfortunately, there isn't a way in Word or PDF to include rich formatting, such as a table, in alternate text.)](data:image/png;base64...)Some images, such as charts or graphs, require long descriptions, but not all document types allow that. In web pages, long descriptions may be provided in several ways: on the page below the image, via a link below the image, or via a link on the image.

## Tables

### Simple Tables

Simple tables have a uniform number of columns and rows, without any merged cells:

| **Screen Reader** | **Responses** | **Share** |
| --- | --- | --- |
| JAWS | 853 | 49% |
| NVDA | 238 | 14% |
| Window-Eyes | 214 | 12% |
| System Access | 181 | 10% |
| VoiceOver | 159 | 9% |

### Complex Tables

The following is a complex table, using merged cells as headers for sections within the table. This can't be made accessible in all types of documents:

|  |  |  |  |  |
| --- | --- | --- | --- | --- |
|  | **May 2012** | | **September 2010** | |
| **Screen Reader** | **Responses** | **Share** | **Responses** | **Share** |
| JAWS | 853 | 49% | 727 | 59% |
| NVDA | 238 | 14% | 105 | 9% |
| Window-Eyes | 214 | 12% | 138 | 11% |
| System Access | 181 | 10% | 58 | 5% |
| VoiceOver | 159 | 9% | 120 | 10% |

## Columns

This is an example of columns. With columns, the page is split into two or more horizontal sections. Unlike tables, in which you usually read across a row and then down to the next, in columns, you read down a column and then across to the next.When columns are not created correctly, screen readers may run lines together, reading the first line of the first column, then the first line of the second column, then the second line of the first column, and so on. Obviously, that is not accessible.

Process finished with exit code 0

转换多种格式示例

1. PDF 转 Markdown

markitdown document.pdf > document.md

MarkItDown 会尝试保留标题、列表、表格和链接。

2. PPTX 转 Markdown

markitdown slides.pptx > slides.md

每一张幻灯片的内容会转换为 Markdown 的标题和列表结构。

3. Excel 转 Markdown

markitdown data.xlsx > data.md

表格会被保留为 Markdown 表格格式,便于在笔记或 GitHub 上阅读。

4. ZIP 文件批量处理

如果有一个 ZIP 包含多个文件:

markitdown archive.zip

MarkItDown 会自动遍历 ZIP 内所有文件并生成 Markdown 输出。

高级示例:带图片和音频

MarkItDown 可以处理图片和音频文件,并尝试提取 EXIF 元数据或进行语音转录:

# 图片
markitdown photo.jpg > photo.md

# 音频
markitdown audio.mp3 > audio.md

在 LLM 工作流中的应用

MarkItDown 特别适合与大语言模型(LLM)结合使用。你可以:

  • 先将各种文档统一转换为 Markdown;

  • 再将 Markdown 作为输入喂给模型进行问答或摘要;

  • 保留结构化内容(标题、列表、表格),提升 LLM 的理解能力。

GitHub - microsoft/markitdown: Python tool for converting files and office documents to Markdown.


网站公告

今日签到

点亮在社区的每一天
去签到