更新解析器使用说明

2026-02-17 15:50:25 +08:00
parent 99927c2263
commit 856700fbe0
1 changed files with 145 additions and 124 deletions
--- a/temp/scripts/README.md
+++ b/temp/scripts/README.md
@@ -13,6 +13,8 @@
 5. **unstructured / pypdf** (成熟的 PDF 库) - PDF 专用
 6. **XML 原生解析** (备选方案) - 无需依赖

+脚本会按照上述优先级依次尝试各解析器，前面的失败后自动回退到下一个，因此建议安装该文档类型对应的所有解析器依赖，以获得最佳兼容性。
+
 ### 特性

 - 支持 DOCX、PPTX、XLSX 和 PDF 格式
@@ -38,85 +40,70 @@ scripts/
 └── README.md     # 本文档
 ```

-## 依赖要求
+## 依赖安装

-### 基础运行（XML 解析）
+脚本基于标准 Python 环境运行（Python 3.6+），使用 `pip` 安装依赖。
+
+由于每种文档类型有多个解析器按优先级依次尝试，建议安装该类型对应的**所有**解析器依赖，这样当高优先级解析器失败时可以自动回退到下一个。
+
+### DOCX 依赖
+
+解析优先级：Docling → pypandoc-binary → MarkItDown → python-docx → XML 原生

 ```bash
-# Python 3.6+
-uv run parser.py file.docx
+pip install docling pypandoc-binary "markitdown[docx]" python-docx
 ```

-### 使用 Docling（推荐）
+### PPTX 依赖
+
+解析优先级：Docling → MarkItDown → python-pptx → XML 原生

 ```bash
-# 通用解析方案，覆盖 DOCX/PPTX/XLSX/PDF
-uv run --with docling parser.py file.docx
-uv run --with docling parser.py file.pptx
-uv run --with docling parser.py file.xlsx
-uv run --with docling parser.py file.pdf
+pip install docling "markitdown[pptx]" python-pptx
 ```

+### XLSX 依赖
+
+解析优先级：Docling → MarkItDown → pandas → XML 原生
+
+```bash
+pip install docling "markitdown[xlsx]" pandas tabulate
+```
+
+### PDF 依赖
+
+解析优先级：Docling → MarkItDown → unstructured → pypdf
+
+```bash
+pip install docling "markitdown[pdf]" unstructured pypdf
+```
+
+### 安装所有依赖
+
+如果需要处理全部文档类型，可以一次性安装所有解析器依赖：
+
+```bash
+pip install docling pypandoc-binary "markitdown[docx,pptx,xlsx,pdf]" python-docx python-pptx pandas tabulate unstructured pypdf
+```
+
+> 注意：MarkItDown 需要按文档类型安装对应的可选依赖，如 `markitdown[docx]`、`markitdown[pptx]`、`markitdown[xlsx]`、`markitdown[pdf]`，直接安装 `markitdown` 不会包含任何格式的解析支持。
+
+### 仅 XML 原生解析（无需安装依赖）
+
+如果不安装任何第三方库，脚本仍可通过内置的 XML 原生解析方式工作（DOCX/PPTX/XLSX），但输出格式和质量相对有限。
+
+### Docling 说明
+
 - Docling 是当前的默认第一优先级解析器，单一依赖即可获得统一输出。
- 首次运行会自动下载 OCR/视觉模型到 `uv` 缓存目录，需保持网络连通。
- 如果只需要 Docling，可无需安装其他解析依赖，脚本会在 Docling 失败时再回退至其他方案。
-
-### 使用 pypandoc-binary（DOCX）
-
-```bash
-# 使用 uv 自动安装
-uv run --with pypandoc-binary parser.py file.docx
-
-# 或手动安装
-pip install pypandoc-binary
-```
-
-### 使用 MarkItDown
-
-```bash
-# 使用 uv 自动安装
-uv run --with markitdown parser.py file.docx
-uv run --with markitdown parser.py file.pptx
-uv run --with markitdown parser.py file.xlsx
-uv run --with "markitdown[pdf]" parser.py file.pdf
-
-# 或手动安装
-pip install markitdown
-# 注意：PDF 支持需要额外安装
-pip install "markitdown[pdf]"
-```
-
-### 使用专用库
-
-```bash
-# 使用 uv 自动安装
-uv run --with python-docx parser.py file.docx
-uv run --with python-pptx parser.py file.pptx
-uv run --with pandas --with tabulate parser.py file.xlsx
-uv run --with unstructured parser.py file.pdf
-uv run --with pypdf parser.py file.pdf
-
-# 或手动安装
-pip install python-docx
-pip install python-pptx
-pip install pandas tabulate
-pip install unstructured
-pip install pypdf
-```
-
-### 所有依赖
-
-```bash
-# 安装所有解析库
-uv run --with docling --with pypandoc-binary --with markitdown --with python-docx --with python-pptx --with pandas --with tabulate --with unstructured --with pypdf parser.py file.pdf
-```
+- 首次运行会自动下载 OCR/视觉模型到缓存目录，需保持网络连通。
+- 脚本会在 Docling 失败时自动回退至其他方案。

 ## 命令行用法

 ### 基本语法

 ```bash
-uv run parser.py <file_path> [options]
+python parser.py <file_path> [options]
 ```

 ### 必需参数
@@ -142,43 +129,40 @@ uv run parser.py <file_path> [options]
 ### 1. 输出完整 Markdown 内容

 ```bash
-# 推荐：Docling 自动解析
-uv run --with docling parser.py report.docx
-uv run --with docling parser.py report.pdf
+# 解析 DOCX
+python parser.py report.docx

-# 使用最佳可用解析器 (DOCX/PPTX/XLSX)
-uv run parser.py report.docx
+# 解析 PDF
+python parser.py report.pdf

-# 使用最佳可用解析器 (PDF)
-uv run parser.py report.pdf
+# 解析 PPTX
+python parser.py presentation.pptx
+
+# 解析 XLSX
+python parser.py data.xlsx

 # 输出到文件
-uv run parser.py report.docx > output.md
-
-# 使用特定依赖
-uv run --with pypandoc-binary parser.py report.docx > output.md
-uv run --with python-docx parser.py report.docx > output.md
-uv run --with pypdf parser.py report.pdf > output.md
+python parser.py report.docx > output.md
 ```

 ### 2. 统计文档信息

 ```bash
 # 统计字数
-uv run --with markitdown parser.py report.docx -c
-uv run --with unstructured parser.py report.pdf -c
+python parser.py report.docx -c
+python parser.py report.pdf -c

 # 统计行数
-uv run --with markitdown parser.py report.docx -l
-uv run --with pypdf parser.py report.pdf -l
+python parser.py report.docx -l
+python parser.py report.pdf -l
 ```

 ### 3. 提取标题

 ```bash
 # 提取所有标题
-uv run --with python-docx parser.py report.docx -t
-uv run --with unstructured parser.py report.pdf -t
+python parser.py report.docx -t
+python parser.py report.pdf -t

 # 输出示例（DOCX）：
 # 第一章 概述
@@ -194,8 +178,8 @@ uv run --with unstructured parser.py report.pdf -t

 ```bash
 # 提取特定章节
-uv run --with python-docx parser.py report.docx -tc "第一章"
-uv run --with unstructured parser.py report.pdf -tc "第一章"
+python parser.py report.docx -tc "第一章"
+python parser.py report.pdf -tc "第一章"

 # 输出该标题及其所有子内容
 ```
@@ -204,16 +188,16 @@ uv run --with unstructured parser.py report.pdf -tc "第一章"

 ```bash
 # 搜索关键词
-uv run --with markitdown parser.py report.docx -s "测试"
-uv run --with unstructured parser.py report.pdf -s "测试"
+python parser.py report.docx -s "测试"
+python parser.py report.pdf -s "测试"

 # 使用正则表达式
-uv run --with markitdown parser.py report.docx -s "章节\s+\d+"
-uv run --with pypdf parser.py report.pdf -s "章节\s+\d+"
+python parser.py report.docx -s "章节\s+\d+"
+python parser.py report.pdf -s "章节\s+\d+"

 # 带上下文搜索（前后各2行）
-uv run --with markitdown parser.py report.docx -s "重要内容" -n 2
-uv run --with "markitdown[pdf]" parser.py report.pdf -s "重要内容" -n 2
+python parser.py report.docx -s "重要内容" -n 2
+python parser.py report.pdf -s "重要内容" -n 2

 # 输出示例：
 ---
@@ -223,6 +207,56 @@ uv run --with "markitdown[pdf]" parser.py report.pdf -s "重要内容" -n 2
 ---
 ```

+## 使用 uv 运行
+
+如果使用 [uv](https://github.com/astral-sh/uv) 作为 Python 环境管理工具，可以通过 `uv run --with` 自动安装依赖并运行脚本，无需手动 `pip install`。
+
+### 基本用法
+
+```bash
+# 无依赖运行（仅 XML 原生解析）
+uv run parser.py file.docx
+
+# 指定依赖运行
+uv run --with docling parser.py file.docx
+```
+
+### 按文档类型运行（安装所有解析器依赖）
+
+```bash
+# DOCX - 安装所有 DOCX 解析器
+uv run --with docling --with pypandoc-binary --with "markitdown[docx]" --with python-docx parser.py report.docx
+
+# PPTX - 安装所有 PPTX 解析器
+uv run --with docling --with "markitdown[pptx]" --with python-pptx parser.py presentation.pptx
+
+# XLSX - 安装所有 XLSX 解析器
+uv run --with docling --with "markitdown[xlsx]" --with pandas --with tabulate parser.py data.xlsx
+
+# PDF - 安装所有 PDF 解析器
+uv run --with docling --with "markitdown[pdf]" --with unstructured --with pypdf parser.py report.pdf
+```
+
+### 安装所有依赖运行
+
+```bash
+uv run --with docling --with pypandoc-binary --with "markitdown[docx,pptx,xlsx,pdf]" --with python-docx --with python-pptx --with pandas --with tabulate --with unstructured --with pypdf parser.py file.pdf
+```
+
+### 批量处理
+
+```bash
+# Linux/Mac
+for file in *.docx; do
+    uv run --with docling --with pypandoc-binary --with "markitdown[docx]" --with python-docx parser.py "$file" > "${file%.docx}.md"
+done
+
+# Windows PowerShell
+Get-ChildItem *.docx | ForEach-Object {
+    uv run --with docling --with pypandoc-binary --with "markitdown[docx]" --with python-docx parser.py $_.FullName > ($_.BaseName + ".md")
+}
+```
+
 ## 解析器对比

 ### DOCX 解析器
@@ -237,7 +271,7 @@ DOCX 文件会按以下优先级依次尝试解析：

 | 解析器 | 优点 | 缺点 | 适用场景 |
 |---------|------|--------|---------|
-| **Docling** | • 单一依赖覆盖所有 Office/PDF 格式<br>• 自动带 OCR，复杂文档召回率高<br>• 输出 Markdown 结构稳定 | • 首次运行需下载较大的模型<br>• 运行时内存占用相对更高 | • 需要“一键完成”解析<br>• 需要 OCR/多模态支持 |
+| **Docling** | • 单一依赖覆盖所有 Office/PDF 格式<br>• 自动带 OCR，复杂文档召回率高<br>• 输出 Markdown 结构稳定 | • 首次运行需下载较大的模型<br>• 运行时内存占用相对更高 | • 需要"一键完成"解析<br>• 需要 OCR/多模态支持 |
 | **pypandoc-binary** | • 自带 Pandoc，可直接使用<br>• 输出 Markdown 结构整洁<br>• 错误信息清晰易排查 | • 仅适用于 DOCX<br>• 依赖包体积较大 | • 需要标准化 Markdown 输出<br>• Docling 不可用时的首选 |
 | **MarkItDown** | • 格式规范<br>• 微软官方支持<br>• 兼容性好 | • 需要安装<br>• 输出较简洁 | • 需要标准格式输出<br>• 自动化文档处理 |
 | **python-docx** | • 输出最详细<br>• 保留完整结构<br>• 支持复杂样式 | • 需要安装<br>• 可能包含多余空行 | • 需要精确控制输出<br>• 分析文档结构 |
@@ -400,6 +434,8 @@ PDF 文件会按以下优先级依次尝试解析：Docling → MarkItDown → u

 ```
 所有解析方法均失败:
+- Docling: 库未安装
+- pypandoc-binary: 库未安装
 - MarkItDown: 库未安装
 - python-docx: 解析失败: ...
 - XML 原生解析: document.xml 不存在或无法访问
@@ -409,19 +445,12 @@ PDF 文件会按以下优先级依次尝试解析：Docling → MarkItDown → u

 ```
 所有解析方法均失败:
+- Docling: 库未安装
 - MarkItDown: MarkItDown 解析失败: ...
 - unstructured: unstructured 库未安装
 - pypdf: pypdf 库未安装
 ```

-所有解析方法均失败:
-
- MarkItDown: 库未安装
- python-docx: 解析失败: ...
- XML 原生解析: document.xml 不存在或无法访问
-
-```
-
 ### 搜索错误

 ```bash
@@ -434,27 +463,17 @@ PDF 文件会按以下优先级依次尝试解析：Docling → MarkItDown → u

 ## 高级用法

-### 结合 uv 运行
-
-```bash
-# 自动安装依赖并运行
-uv run --with markitdown --with python-docx parser.py report.docx
-
-# 输出到文件
-uv run --with python-docx parser.py report.docx > output.md
-```
-
 ### 批量处理

 ```bash
-# 使用 find 或 glob 批量处理
+# Linux/Mac
 for file in *.docx; do
-    uv run --with markitdown parser.py "$file" > "${file%.docx}.md"
+    python parser.py "$file" > "${file%.docx}.md"
 done

 # Windows PowerShell
 Get-ChildItem *.docx | ForEach-Object {
-    uv run --with markitdown parser.py $_.FullName > ($_.BaseName + ".md")
+    python parser.py $_.FullName > ($_.BaseName + ".md")
 }
 ```

@@ -462,10 +481,10 @@ Get-ChildItem *.docx | ForEach-Object {

 ```bash
 # 进一步处理 Markdown 输出
-uv run --with markitdown parser.py report.docx | grep "重要" > important.md
+python parser.py report.docx | grep "重要" > important.md

 # 统计处理
-uv run --with markitdown parser.py report.docx -l | awk '{print $1}'
+python parser.py report.docx -l | awk '{print $1}'
 ```

 ## 常见问题
@@ -478,7 +497,7 @@ A: 不同解析器的输出详细度不同：
 - `MarkItDown` 输出较简洁
 - `XML 原生` 输出原始内容

-如需完整内容，尝试使用专用库解析器。
+建议安装该文档类型对应的所有解析器依赖，脚本会自动按优先级选择最佳可用解析器。

 ### Q: PDF 文件没有标题层级？

@@ -508,19 +527,22 @@ export LANG=en_US.UTF-8

 A: 当前版本自动选择最佳可用解析器。可以通过注释代码中的解析器列表来限制，或安装/卸载特定依赖。

-### Q: MarkItDown 提示 PDF 依赖未安装？
+### Q: MarkItDown 提示依赖未安装？

-A: MarkItDown 的 PDF 支持是可选依赖，需要使用 `markitdown[pdf]` 而非 `markitdown`：
+A: MarkItDown 需要按文档类型安装对应的可选依赖，直接安装 `markitdown` 不会包含任何格式支持：

 ```bash
-# 错误
-uv run --with markitdown parser.py file.pdf
+# 错误 - 不包含任何格式支持
+pip install markitdown

-# 正确
-uv run --with "markitdown[pdf]" parser.py file.pdf
-
-# 或手动安装
+# 正确 - 按需安装对应格式
+pip install "markitdown[docx]"
+pip install "markitdown[pptx]"
+pip install "markitdown[xlsx]"
 pip install "markitdown[pdf]"
+
+# 或一次性安装所有格式
+pip install "markitdown[docx,pptx,xlsx,pdf]"
 ```

 ### Q: 大文件处理慢？
@@ -593,5 +615,4 @@ A: 大文件建议使用 XML 原生解析（最快），或在脚本外部处理
 - 添加 PDF 文件支持（MarkItDown、unstructured、pypdf）
 - 增强错误处理（文件存在性检查、无效格式检测）
 - 完善文档和示例
- 使用 uv 进行依赖管理和运行
 - 所有模块通过语法检查和功能测试