Files
lyxy-document/scripts/config.py
lanyuanxiaoyao e0c6ed1638 feat: 新增 .doc 格式支持,借助 LibreOffice soffice
- 提取 LibreOffice 解析逻辑为公共工具函数 _utils.parse_via_libreoffice()
- 新增 DocReader 独立 Reader,支持 .doc 格式
- 新增 is_valid_doc() 文件验证函数(复用 OLE2 检测)
- 新增 doc 格式依赖配置(独立配置)
- 新增完整的测试套件,使用静态测试文件
- 更新 README.md 和 SKILL.md,添加 .doc 格式支持说明
- 新增 openspec/specs/doc-reader/spec.md 规范文档

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 10:40:43 +08:00

121 lines
2.9 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""统一配置类,集中管理所有配置项。"""
class Config:
"""统一配置类"""
# 编码检测
# 回退编码列表,当 chardet 检测失败时依次尝试
FALLBACK_ENCODINGS = ['utf-8', 'gbk', 'gb2312', 'latin-1']
# HTML 下载
# 下载超时时间(秒)
DOWNLOAD_TIMEOUT = 30
# HTTP User-Agent 标识
USER_AGENT = "lyxy-document/0.1.0"
# 日志
# 日志等级,默认只输出 ERROR 级别避免干扰 Markdown 输出
LOG_LEVEL = "ERROR"
# 依赖配置:按文件类型和平台组织
# 每个平台配置包含 python 版本要求None 表示使用默认)和依赖列表
DEPENDENCIES = {
"pdf": {
"default": {
"python": None,
"dependencies": [
"docling",
"unstructured[pdf]",
"markitdown[pdf]",
"pypdf",
"markdownify"
]
},
"Darwin-x86_64": {
"python": "3.12",
"dependencies": [
"docling==2.40.0",
"docling-parse==4.0.0",
"numpy<2",
"markitdown[pdf]",
"pypdf",
"markdownify"
]
}
},
"docx": {
"default": {
"python": None,
"dependencies": [
"docling",
"unstructured[docx]",
"markitdown[docx]",
"pypandoc-binary",
"python-docx",
"markdownify"
]
}
},
"xlsx": {
"default": {
"python": None,
"dependencies": [
"docling",
"unstructured[xlsx]",
"markitdown[xlsx]",
"pandas",
"tabulate"
]
}
},
"pptx": {
"default": {
"python": None,
"dependencies": [
"docling",
"unstructured[pptx]",
"markitdown[pptx]",
"python-pptx",
"markdownify"
]
}
},
"html": {
"default": {
"python": None,
"dependencies": [
"trafilatura",
"domscribe",
"markitdown",
"html2text",
"beautifulsoup4",
"httpx",
"chardet",
"pyppeteer",
"selenium"
]
}
},
"xls": {
"default": {
"python": None,
"dependencies": [
"unstructured[xlsx]",
"markitdown[xls]",
"pandas",
"tabulate",
"xlrd",
"olefile"
]
}
},
"doc": {
"default": {
"python": None,
"dependencies": []
}
}
}