Files
lyxy-document/scripts/readers/pdf/unstructured_ocr.py
lanyuanxiaoyao 1aea561277 refactor: 重构 Reader 内部工具函数到独立模块
- 新增 scripts/readers/_utils.py 作为 Reader 内部共享工具模块
- 将 parse_with_markitdown 等函数从 core/markdown.py 迁移到 _utils.py
- 函数重命名:parse_with_xxx → parse_via_xxx,_unstructured_elements_to_markdown → convert_unstructured_to_markdown
- 更新 17 个 Reader 实现文件的 import 路径
- 从 core/__init__.py 移除已迁移函数的导出
- 新增测试文件 tests/test_readers/test_utils.py
- 新增 spec 文档 openspec/specs/reader-internal-utils/spec.md

这次重构明确了模块边界:core/ 提供公共 API,readers/_utils.py 提供 Reader 内部工具
2026-03-09 00:56:05 +08:00

35 lines
1.2 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""使用 unstructured 库解析 PDF 文件hi_res 策略 + PaddleOCR"""
from typing import Optional, Tuple
from scripts.readers._utils import convert_unstructured_to_markdown
def parse(file_path: str) -> Tuple[Optional[str], Optional[str]]:
"""使用 unstructured 库解析 PDF 文件hi_res 策略 + PaddleOCR"""
try:
from unstructured.partition.pdf import partition_pdf
except ImportError:
return None, "unstructured 库未安装"
try:
from unstructured.partition.utils.constants import OCR_AGENT_PADDLE
except ImportError:
return None, "unstructured-paddleocr 库未安装"
try:
elements = partition_pdf(
filename=file_path,
infer_table_structure=True,
strategy="hi_res",
languages=["chi_sim"],
ocr_agent=OCR_AGENT_PADDLE,
table_ocr_agent=OCR_AGENT_PADDLE,
)
content = convert_unstructured_to_markdown(elements, trust_titles=True)
if not content.strip():
return None, "文档为空"
return content, None
except Exception as e:
return None, f"unstructured OCR 解析失败: {str(e)}"