Files
lyxy-document/scripts/config.py
lanyuanxiaoyao e53e64d386 refactor: 优化 chardet 依赖配置,仅保留在 HTML reader 中
- 从 pdf/docx/xlsx/pptx reader 的依赖列表中移除 chardet
- 保留 chardet 在 html reader 的依赖配置中(唯一实际使用方)
- 更新 README.md 文档,移除不必要的 chardet 依赖说明
- 简化测试命令,移除非 HTML reader 测试中的 chardet
2026-03-10 12:44:35 +08:00

102 lines
2.5 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""统一配置类,集中管理所有配置项。"""
class Config:
"""统一配置类"""
# 编码检测
# 回退编码列表,当 chardet 检测失败时依次尝试
FALLBACK_ENCODINGS = ['utf-8', 'gbk', 'gb2312', 'latin-1']
# HTML 下载
# 下载超时时间(秒)
DOWNLOAD_TIMEOUT = 30
# HTTP User-Agent 标识
USER_AGENT = "lyxy-document/0.1.0"
# 日志
# 日志等级,默认只输出 ERROR 级别避免干扰 Markdown 输出
LOG_LEVEL = "ERROR"
# 依赖配置:按文件类型和平台组织
# 每个平台配置包含 python 版本要求None 表示使用默认)和依赖列表
DEPENDENCIES = {
"pdf": {
"default": {
"python": None,
"dependencies": [
"docling",
"unstructured[pdf]",
"markitdown[pdf]",
"pypdf",
"markdownify"
]
},
"Darwin-x86_64": {
"python": "3.12",
"dependencies": [
"docling==2.40.0",
"docling-parse==4.0.0",
"numpy<2",
"markitdown[pdf]",
"pypdf",
"markdownify"
]
}
},
"docx": {
"default": {
"python": None,
"dependencies": [
"docling",
"unstructured[docx]",
"markitdown[docx]",
"pypandoc-binary",
"python-docx",
"markdownify"
]
}
},
"xlsx": {
"default": {
"python": None,
"dependencies": [
"docling",
"unstructured[xlsx]",
"markitdown[xlsx]",
"pandas",
"tabulate"
]
}
},
"pptx": {
"default": {
"python": None,
"dependencies": [
"docling",
"unstructured[pptx]",
"markitdown[pptx]",
"python-pptx",
"markdownify"
]
}
},
"html": {
"default": {
"python": None,
"dependencies": [
"trafilatura",
"domscribe",
"markitdown",
"html2text",
"beautifulsoup4",
"httpx",
"chardet",
"pyppeteer",
"selenium"
]
}
}
}