加入pandoc解析docx

2026-02-15 21:54:54 +08:00
parent f167aa2111
commit 4324699a3d
3 changed files with 54 additions and 6 deletions
--- a/temp/scripts/README.md
+++ b/temp/scripts/README.md
@@ -6,10 +6,11 @@

 该解析器按优先级尝试多种解析方法，确保最大兼容性：

-1. **MarkItDown** (微软官方库) - 推荐使用，格式最规范
-2. **python-docx / python-pptx / pandas** (成熟的 Python 库) - 输出最详细
-3. **unstructured / pypdf** (成熟的 PDF 库) - PDF 专用
-4. **XML 原生解析** (备选方案) - 无需依赖
+1. **pypandoc-binary** (DOCX 专用，内置 Pandoc) - 生成结构化 Markdown
+2. **MarkItDown** (微软官方库) - 推荐使用，格式最规范
+3. **python-docx / python-pptx / pandas** (成熟的 Python 库) - 输出最详细
+4. **unstructured / pypdf** (成熟的 PDF 库) - PDF 专用
+5. **XML 原生解析** (备选方案) - 无需依赖

 ### 特性

@@ -44,6 +45,16 @@ scripts/
 uv run parser.py file.docx
 ```

+### 使用 pypandoc-binary（DOCX）
+
+```bash
+# 使用 uv 自动安装
+uv run --with pypandoc-binary parser.py file.docx
+
+# 或手动安装
+pip install pypandoc-binary
+```
+
 ### 使用 MarkItDown

 ```bash
@@ -81,7 +92,7 @@ pip install pypdf

 ```bash
 # 安装所有解析库
-uv run --with markitdown --with python-docx --with python-pptx --with pandas --with tabulate --with unstructured --with pypdf parser.py file.pdf
+uv run --with pypandoc-binary --with markitdown --with python-docx --with python-pptx --with pandas --with tabulate --with unstructured --with pypdf parser.py file.pdf
 ```

 ## 命令行用法
@@ -125,6 +136,7 @@ uv run parser.py report.pdf
 uv run parser.py report.docx > output.md

 # 使用特定依赖
+uv run --with pypandoc-binary parser.py report.docx > output.md
 uv run --with python-docx parser.py report.docx > output.md
 uv run --with pypdf parser.py report.pdf > output.md
 ```
@@ -195,8 +207,16 @@ uv run --with "markitdown[pdf]" parser.py report.pdf -s "重要内容" -n 2

 ### DOCX 解析器

+DOCX 文件会按以下优先级依次尝试解析：
+
+1. pypandoc-binary
+2. MarkItDown
+3. python-docx
+4. XML 原生
+
 | 解析器 | 优点 | 缺点 | 适用场景 |
 |---------|------|--------|---------|
+| **pypandoc-binary** | • 自带 Pandoc，可直接使用<br>• 输出 Markdown 结构整洁<br>• 错误信息清晰易排查 | • 仅适用于 DOCX<br>• 依赖包体积较大 | • 需要标准化 Markdown 输出<br>• 首选解析路径 |
 | **MarkItDown** | • 格式规范<br>• 微软官方支持<br>• 兼容性好 | • 需要安装<br>• 输出较简洁 | • 需要标准格式输出<br>• 自动化文档处理 |
 | **python-docx** | • 输出最详细<br>• 保留完整结构<br>• 支持复杂样式 | • 需要安装<br>• 可能包含多余空行 | • 需要精确控制输出<br>• 分析文档结构 |
 | **XML 原生** | • 无需依赖<br>• 运行速度快<br>• 输出原始内容 | • 格式可能不统一<br>• 样式处理有限 | • 依赖不可用时<br>• 快速提取内容 |
@@ -533,6 +553,7 @@ A: 大文件建议使用 XML 原生解析（最快），或在脚本外部处理

 ### 最新版本

+- DOCX 解析新增 pypandoc-binary 方案并设置为最高优先级
 - 将单体脚本拆分为模块化结构（common.py, docx.py, pptx.py, xlsx.py, parser.py）
 - 添加 XLSX 文件支持
 - 添加 PDF 文件支持（MarkItDown、unstructured、pypdf）