Files
lyxy-document/scripts/readers/pdf/docling_ocr.py
lanyuanxiaoyao 15b63800a8 refactor: 将核心代码迁移到 scripts 目录
- 创建 scripts/ 目录作为核心代码根目录
- 移动 core/, readers/, utils/ 到 scripts/ 下
- 移动 config.py, lyxy_document_reader.py 到 scripts/
- 移动 encoding_detection.py 到 scripts/utils/
- 更新 pyproject.toml 中的入口点路径和 pytest 配置
- 更新所有内部导入语句为 scripts.* 模块
- 更新 README.md 目录结构说明
- 更新 openspec/config.yaml 添加目录结构说明
- 删除无用的 main.py

此变更使项目结构更清晰,便于区分核心代码与测试、文档等支撑文件。
2026-03-08 17:41:03 +08:00

22 lines
731 B
Python
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""使用 docling 库解析 PDF 文件(启用 OCR"""
from typing import Optional, Tuple
def parse(file_path: str) -> Tuple[Optional[str], Optional[str]]:
"""使用 docling 库解析 PDF 文件(启用 OCR"""
try:
from docling.document_converter import DocumentConverter
except ImportError:
return None, "docling 库未安装"
try:
converter = DocumentConverter()
result = converter.convert(file_path)
markdown_content = result.document.export_to_markdown()
if not markdown_content.strip():
return None, "文档为空"
return markdown_content, None
except Exception as e:
return None, f"docling OCR 解析失败: {str(e)}"