增加pdf文件的读取

2026-02-14 23:20:47 +08:00
parent 8c27b08fdc
commit b022ac736b
4 changed files with 201 additions and 14 deletions
--- a/temp/scripts/README.md
+++ b/temp/scripts/README.md
@@ -1,6 +1,6 @@
 # Document Parser 使用说明

-一个模块化的文档解析器，支持将 DOCX、PPTX 和 XLSX 文件转换为 Markdown 格式。
+一个模块化的文档解析器，支持将 DOCX、PPTX、XLSX 和 PDF 文件转换为 Markdown 格式。

 ## 概述

@@ -8,11 +8,12 @@

 1. **MarkItDown** (微软官方库) - 推荐使用，格式最规范
 2. **python-docx / python-pptx / pandas** (成熟的 Python 库) - 输出最详细
-3. **XML 原生解析** (备选方案) - 无需依赖
+3. **unstructured / pypdf** (成熟的 PDF 库) - PDF 专用
+4. **XML 原生解析** (备选方案) - 无需依赖

 ### 特性

- 支持 DOCX、PPTX 和 XLSX 格式
+- 支持 DOCX、PPTX、XLSX 和 PDF 格式
 - 自动检测文件类型和有效性
 - 保留文本格式（粗体、斜体、下划线）
 - 提取表格并转换为 Markdown 格式
@@ -29,6 +30,7 @@ scripts/
 ├── docx.py       # DOCX 文件解析
 ├── pptx.py       # PPTX 文件解析
 ├── xlsx.py       # XLSX 文件解析
+├── pdf.py        # PDF 文件解析
 ├── parser.py     # 命令行入口
 └── README.md     # 本文档
 ```
@@ -49,9 +51,12 @@ uv run parser.py file.docx
 uv run --with markitdown parser.py file.docx
 uv run --with markitdown parser.py file.pptx
 uv run --with markitdown parser.py file.xlsx
+uv run --with "markitdown[pdf]" parser.py file.pdf

 # 或手动安装
 pip install markitdown
+# 注意：PDF 支持需要额外安装
+pip install "markitdown[pdf]"
 ```

 ### 使用专用库
@@ -61,18 +66,22 @@ pip install markitdown
 uv run --with python-docx parser.py file.docx
 uv run --with python-pptx parser.py file.pptx
 uv run --with pandas --with tabulate parser.py file.xlsx
+uv run --with unstructured parser.py file.pdf
+uv run --with pypdf parser.py file.pdf

 # 或手动安装
 pip install python-docx
 pip install python-pptx
 pip install pandas tabulate
+pip install unstructured
+pip install pypdf
 ```

 ### 所有依赖

 ```bash
 # 安装所有解析库
-uv run --with markitdown --with python-docx --with python-pptx --with pandas --with tabulate parser.py file.docx
+uv run --with markitdown --with python-docx --with python-pptx --with pandas --with tabulate --with unstructured --with pypdf parser.py file.pdf
 ```

 ## 命令行用法
@@ -85,7 +94,7 @@ uv run parser.py <file_path> [options]

 ### 必需参数

- `file_path`: DOCX、PPTX 或 XLSX 文件的路径（相对或绝对路径）
+- `file_path`: DOCX、PPTX、XLSX 或 PDF 文件的路径（相对或绝对路径）

 ### 可选参数（互斥组，一次只能使用一个）

@@ -106,14 +115,18 @@ uv run parser.py <file_path> [options]
 ### 1. 输出完整 Markdown 内容

 ```bash
-# 使用最佳可用解析器
+# 使用最佳可用解析器 (DOCX/PPTX/XLSX)
 uv run parser.py report.docx

+# 使用最佳可用解析器 (PDF)
+uv run parser.py report.pdf
+
 # 输出到文件
 uv run parser.py report.docx > output.md

 # 使用特定依赖
 uv run --with python-docx parser.py report.docx > output.md
+uv run --with pypdf parser.py report.pdf > output.md
 ```

 ### 2. 统计文档信息
@@ -121,9 +134,11 @@ uv run --with python-docx parser.py report.docx > output.md
 ```bash
 # 统计字数
 uv run --with markitdown parser.py report.docx -c
+uv run --with unstructured parser.py report.pdf -c

 # 统计行数
 uv run --with markitdown parser.py report.docx -l
+uv run --with pypdf parser.py report.pdf -l
 ```

 ### 3. 提取标题
@@ -131,12 +146,16 @@ uv run --with markitdown parser.py report.docx -l
 ```bash
 # 提取所有标题
 uv run --with python-docx parser.py report.docx -t
+uv run --with unstructured parser.py report.pdf -t

-# 输出示例：
+# 输出示例（DOCX）：
 # 第一章 概述
 ## 1.1 背景
 ## 1.2 目标
 # 第二章 实现
+
+# 输出示例（PDF - 注意：PDF 通常不包含明确的标题层级）：
+[内容提取成功，但 PDF 可能缺乏清晰的标题结构]
 ```

 ### 4. 提取特定标题内容
@@ -144,6 +163,7 @@ uv run --with python-docx parser.py report.docx -t
 ```bash
 # 提取特定章节
 uv run --with python-docx parser.py report.docx -tc "第一章"
+uv run --with unstructured parser.py report.pdf -tc "第一章"

 # 输出该标题及其所有子内容
 ```
@@ -153,12 +173,15 @@ uv run --with python-docx parser.py report.docx -tc "第一章"
 ```bash
 # 搜索关键词
 uv run --with markitdown parser.py report.docx -s "测试"
+uv run --with unstructured parser.py report.pdf -s "测试"

 # 使用正则表达式
 uv run --with markitdown parser.py report.docx -s "章节\s+\d+"
+uv run --with pypdf parser.py report.pdf -s "章节\s+\d+"

 # 带上下文搜索（前后各2行）
 uv run --with markitdown parser.py report.docx -s "重要内容" -n 2
+uv run --with "markitdown[pdf]" parser.py report.pdf -s "重要内容" -n 2

 # 输出示例：
 ---
@@ -194,6 +217,14 @@ uv run --with markitdown parser.py report.docx -s "重要内容" -n 2
 | **pandas** | • 功能强大<br>• 支持复杂表格<br>• 数据处理灵活 | • 需要安装<br>• 依赖较多 | • 数据分析<br>• 复杂表格处理 |
 | **XML 原生** | • 无需依赖<br>• 运行速度快<br>• 支持所有单元格类型 | • 格式可能不统一<br>• 无数据处理能力 | • 依赖不可用时<br>• 快速提取内容 |

+### PDF 解析器
+
+| 解析器 | 优点 | 缺点 | 适用场景 |
+|---------|------|--------|---------|
+| **MarkItDown** | • 格式规范<br>• 微软官方支持<br>• 兼容性好 | • 需要安装 `markitdown[pdf]`<br>• 输出较简洁 | • 需要标准格式输出<br>• 自动化文档处理 |
+| **unstructured** | • 功能强大<br>• 支持表格提取<br>• 文本组织性好 | • 需要安装<br>• 可能包含页码标记 | • 需要完整内容<br>• 分析文档结构 |
+| **pypdf** | • 轻量级<br>• 速度快<br>• 安装简单 | • 需要安装<br>• 功能相对简单 | • 快速提取内容<br>• 简单文本提取 |
+
 ## 输出格式

 ### Markdown 输出结构
@@ -248,6 +279,18 @@ uv run --with markitdown parser.py report.docx -s "重要内容" -n 2
 | 数据1 | 数据2 | 数据3 |
 ```

+### PDF 特有格式
+
+```markdown
+[PDF 文件的纯文本内容，按段落提取]
+
+中电信粤亿迅〔2023〕3号
+
+关于印发关于印发关于印发关于印发《《《《广东亿迅科技有限公司员工
+
+[注：PDF 通常不包含明确的标题层级结构，内容以文本流形式呈现]
+```
+
 ### 标题格式

 - 标题使用 Markdown 井号语法：`#` 到 `######`（1-6级）
@@ -297,7 +340,7 @@ uv run --with markitdown parser.py report.docx -s "重要内容" -n 2
 错误: 文件不存在: missing.docx

 # 无效格式
-错误: 不是有效的 DOCX、PPTX 或 XLSX 格式: invalid.txt
+错误: 不是有效的 DOCX、PPTX、XLSX 或 PDF 格式: invalid.txt
 ```

 ### 解析器回退
@@ -311,6 +354,23 @@ uv run --with markitdown parser.py report.docx -s "重要内容" -n 2
 - XML 原生解析: document.xml 不存在或无法访问
 ```

+**PDF 回退示例**:
+
+```
+所有解析方法均失败:
+- MarkItDown: MarkItDown 解析失败: ...
+- unstructured: unstructured 库未安装
+- pypdf: pypdf 库未安装
+```
+
+所有解析方法均失败:
+
+- MarkItDown: 库未安装
+- python-docx: 解析失败: ...
+- XML 原生解析: document.xml 不存在或无法访问
+
+```
+
 ### 搜索错误

 ```bash
@@ -369,6 +429,14 @@ A: 不同解析器的输出详细度不同：

 如需完整内容，尝试使用专用库解析器。

+### Q: PDF 文件没有标题层级？
+
+A: PDF 是一种版面描述格式，通常不包含语义化的标题层级结构。与 DOCX/PPTX 不同，PDF 中的标题只是视觉上的文本样式，解析器无法准确识别标题层级。建议：
+
+- 使用搜索功能查找特定内容
+- 使用 `-l` 统计行数了解文档长度
+- 使用 `-c` 统计字数了解文档规模
+
 ### Q: 表格格式不正确？

 A: 确保原始文档中的表格结构完整。XML 解析器可能无法处理复杂表格。
@@ -389,6 +457,21 @@ export LANG=en_US.UTF-8

 A: 当前版本自动选择最佳可用解析器。可以通过注释代码中的解析器列表来限制，或安装/卸载特定依赖。

+### Q: MarkItDown 提示 PDF 依赖未安装？
+
+A: MarkItDown 的 PDF 支持是可选依赖，需要使用 `markitdown[pdf]` 而非 `markitdown`：
+
+```bash
+# 错误
+uv run --with markitdown parser.py file.pdf
+
+# 正确
+uv run --with "markitdown[pdf]" parser.py file.pdf
+
+# 或手动安装
+pip install "markitdown[pdf]"
+```
+
 ### Q: 大文件处理慢？

 A: 大文件建议使用 XML 原生解析（最快），或在脚本外部处理。
@@ -421,6 +504,14 @@ A: 大文件建议使用 XML 原生解析（最快），或在脚本外部处理
 | pandas | ~6,000 | ~109 | 中 |
 | XML 原生 | ~6,000 | ~109 | 快 |

+### PDF (test.pdf)
+
+| 解析器 | 字符数 | 行数 | 相对速度 |
+|---------|--------|------|---------|
+| MarkItDown | ~8,200 | ~1,120 | 快 |
+| unstructured | ~8,400 | ~600 | 中 |
+| pypdf | ~8,400 | ~600 | 快 |
+
 ## 代码风格

 脚本遵循以下代码风格：
@@ -444,6 +535,7 @@ A: 大文件建议使用 XML 原生解析（最快），或在脚本外部处理

 - 将单体脚本拆分为模块化结构（common.py, docx.py, pptx.py, xlsx.py, parser.py）
 - 添加 XLSX 文件支持
+- 添加 PDF 文件支持（MarkItDown、unstructured、pypdf）
 - 增强错误处理（文件存在性检查、无效格式检测）
 - 完善文档和示例
 - 使用 uv 进行依赖管理和运行
--- a/temp/scripts/common.py
+++ b/temp/scripts/common.py
@@ -114,6 +114,16 @@ def is_valid_xlsx(file_path: str) -> bool:
        return False


+def is_valid_pdf(file_path: str) -> bool:
+    """验证文件是否为有效的 PDF 格式"""
+    try:
+        with open(file_path, "rb") as f:
+            header = f.read(4)
+            return header == b"%PDF"
+    except (IOError, OSError):
+        return False
+
+
 def remove_markdown_images(markdown_text: str) -> str:
    """移除 Markdown 文本中的图片标记"""
    return IMAGE_PATTERN.sub("", markdown_text)
@@ -286,7 +296,7 @@ def search_markdown(


 def detect_file_type(file_path: str) -> Optional[str]:
-    """检测文件类型，返回 'docx'、'pptx' 或 'xlsx'"""
+    """检测文件类型，返回 'docx'、'pptx'、'xlsx' 或 'pdf'"""
    _, ext = os.path.splitext(file_path)
    ext = ext.lower()

@@ -299,5 +309,8 @@ def detect_file_type(file_path: str) -> Optional[str]:
    elif ext == ".xlsx":
        if is_valid_xlsx(file_path):
            return "xlsx"
+    elif ext == ".pdf":
+        if is_valid_pdf(file_path):
+            return "pdf"

    return None
--- a/temp/scripts/parser.py
+++ b/temp/scripts/parser.py
@@ -1,5 +1,5 @@
 #!/usr/bin/env python3
-"""文档解析器命令行交互模块，提供命令行接口。"""
+"""文档解析器命令行交互模块，提供命令行接口。支持 DOCX、PPTX、XLSX 和 PDF 文件。"""

 import argparse
 import os
@@ -7,16 +7,17 @@ import sys

 import common
 import docx
+import pdf
 import pptx
 import xlsx


 def main() -> None:
    parser = argparse.ArgumentParser(
-        description="将 DOCX、PPTX 或 XLSX 文件解析为 Markdown"
+        description="将 DOCX、PPTX、XLSX 或 PDF 文件解析为 Markdown"
    )

-    parser.add_argument("file_path", help="DOCX、PPTX 或 XLSX 文件的绝对路径")
+    parser.add_argument("file_path", help="DOCX、PPTX、XLSX 或 PDF 文件的绝对路径")

    parser.add_argument(
        "-n",
@@ -58,7 +59,7 @@ def main() -> None:

    file_type = common.detect_file_type(args.file_path)
    if not file_type:
-        print(f"错误: 不是有效的 DOCX、PPTX 或 XLSX 格式: {args.file_path}")
+        print(f"错误: 不是有效的 DOCX、PPTX、XLSX 或 PDF 格式: {args.file_path}")
        sys.exit(1)

    if file_type == "docx":
@@ -73,12 +74,18 @@ def main() -> None:
            ("python-pptx", pptx.parse_pptx_with_python_pptx),
            ("XML 原生解析", pptx.parse_pptx_with_xml),
        ]
-    else:
+    elif file_type == "xlsx":
        parsers = [
            ("MarkItDown", xlsx.parse_xlsx_with_markitdown),
            ("pandas", xlsx.parse_xlsx_with_pandas),
            ("XML 原生解析", xlsx.parse_xlsx_with_xml),
        ]
+    else:
+        parsers = [
+            ("MarkItDown", pdf.parse_pdf_with_markitdown),
+            ("unstructured", pdf.parse_pdf_with_unstructured),
+            ("pypdf", pdf.parse_pdf_with_pypdf),
+        ]

    failures = []
    content = None
--- a/temp/scripts/pdf.py
+++ b/temp/scripts/pdf.py
@@ -0,0 +1,75 @@
+#!/usr/bin/env python3
+"""PDF 文件解析模块，提供三种解析方法。"""
+
+from typing import Optional, Tuple
+
+
+def parse_pdf_with_markitdown(file_path: str) -> Tuple[Optional[str], Optional[str]]:
+    """使用 MarkItDown 库解析 PDF 文件"""
+    try:
+        from markitdown import MarkItDown
+
+        md = MarkItDown()
+        result = md.convert(file_path)
+        if not result.text_content.strip():
+            return None, "文档为空"
+        return result.text_content, None
+    except ImportError:
+        return None, "MarkItDown 库未安装"
+    except Exception as e:
+        return None, f"MarkItDown 解析失败: {str(e)}"
+
+
+def parse_pdf_with_unstructured(file_path: str) -> Tuple[Optional[str], Optional[str]]:
+    """使用 unstructured 库解析 PDF 文件"""
+    try:
+        from unstructured.partition.pdf import partition_pdf
+    except ImportError:
+        return None, "unstructured 库未安装"
+
+    try:
+        elements = partition_pdf(
+            filename=file_path,
+            strategy="fast",
+            infer_table_structure=True,
+            extract_images_in_pdf=False,
+        )
+
+        md_lines = []
+        for element in elements:
+            if hasattr(element, "text") and element.text and element.text.strip():
+                text = element.text.strip()
+                md_lines.append(text)
+                md_lines.append("")
+
+        content = "\n".join(md_lines).strip()
+        if not content:
+            return None, "文档为空"
+        return content, None
+    except Exception as e:
+        return None, f"unstructured 解析失败: {str(e)}"
+
+
+def parse_pdf_with_pypdf(file_path: str) -> Tuple[Optional[str], Optional[str]]:
+    """使用 pypdf 库解析 PDF 文件"""
+    try:
+        from pypdf import PdfReader
+    except ImportError:
+        return None, "pypdf 库未安装"
+
+    try:
+        reader = PdfReader(file_path)
+        md_content = []
+
+        for page in reader.pages:
+            text = page.extract_text(extraction_mode="plain")
+            if text and text.strip():
+                md_content.append(text.strip())
+                md_content.append("")
+
+        content = "\n".join(md_content).strip()
+        if not content:
+            return None, "文档为空"
+        return content, None
+    except Exception as e:
+        return None, f"pypdf 解析失败: {str(e)}"