Files
lyxy-document/scripts/readers/pdf/__init__.py
lanyuanxiaoyao 7eab1dcef1 test: 添加全面的测试套件,覆盖所有 Reader 实现
- 测试数量从 83 个增加到 193 个 (+132%)
- 代码覆盖率从 48% 提升到 69% (+44%)
- 为每种文档格式的所有 Reader 实现创建独立测试
- 添加跨 Reader 的一致性验证测试
- 新增 4 个测试规范 (cli-testing, exception-testing, reader-testing, test-fixtures)
- 更新 README 测试统计信息

测试覆盖:
- DOCX: python-docx, markitdown, docling, native-xml, pypandoc, unstructured
- PDF: pypdf, markitdown, docling, docling-ocr, unstructured, unstructured-ocr
- HTML: html2text, markitdown, trafilatura, domscribe
- PPTX: python-pptx, markitdown, docling, native-xml, unstructured
- XLSX: pandas, markitdown, docling, native-xml, unstructured
- CLI: 所有命令行选项和错误处理

所有 193 个测试通过。
2026-03-08 22:20:21 +08:00

58 lines
1.5 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""PDF 文件阅读器支持多种解析方法OCR 优先)。"""
import os
from typing import List, Optional, Tuple
from scripts.readers.base import BaseReader
from scripts.utils import is_valid_pdf
from . import docling_ocr
from . import unstructured_ocr
from . import docling
from . import unstructured
from . import markitdown
from . import pypdf
PARSERS = [
("docling OCR", docling_ocr.parse),
("unstructured OCR", unstructured_ocr.parse),
("docling", docling.parse),
("unstructured", unstructured.parse),
("MarkItDown", markitdown.parse),
("pypdf", pypdf.parse),
]
class PdfReader(BaseReader):
"""PDF 文件阅读器"""
@property
def supported_extensions(self) -> List[str]:
return [".pdf"]
def supports(self, file_path: str) -> bool:
return file_path.lower().endswith('.pdf')
def parse(self, file_path: str) -> Tuple[Optional[str], List[str]]:
failures = []
# 检查文件是否存在
if not os.path.exists(file_path):
return None, ["文件不存在"]
# 验证文件格式
if not is_valid_pdf(file_path):
return None, ["不是有效的 PDF 文件"]
content = None
for parser_name, parser_func in PARSERS:
content, error = parser_func(file_path)
if content is not None:
return content, failures
else:
failures.append(f"- {parser_name}: {error}")
return None, failures