Files
lyxy-document/scripts/readers/pdf/unstructured.py
lanyuanxiaoyao 9daff73589 refactor: 调整模块导入路径,简化引用结构
- 更新 openspec/config.yaml 中 git 任务相关说明
- 将 scripts.core.* 改为 core.*,scripts.readers.* 改为 readers.*
- 优化 lyxy_document_reader.py 中 sys.path 设置方式
- 同步更新所有测试文件的导入路径
2026-03-09 15:44:51 +08:00

29 lines
969 B
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""使用 unstructured 库解析 PDF 文件fast 策略)"""
from typing import Optional, Tuple
from readers._utils import convert_unstructured_to_markdown
def parse(file_path: str) -> Tuple[Optional[str], Optional[str]]:
"""使用 unstructured 库解析 PDF 文件fast 策略)"""
try:
from unstructured.partition.pdf import partition_pdf
except ImportError:
return None, "unstructured 库未安装"
try:
elements = partition_pdf(
filename=file_path,
infer_table_structure=True,
strategy="fast",
languages=["chi_sim"],
)
# fast 策略不做版面分析Title 类型标注不可靠
content = convert_unstructured_to_markdown(elements, trust_titles=False)
if not content.strip():
return None, "文档为空"
return content, None
except Exception as e:
return None, f"unstructured 解析失败: {str(e)}"