diff --git a/openspec/config.yaml b/openspec/config.yaml index db8d25c..ada3a83 100644 --- a/openspec/config.yaml +++ b/openspec/config.yaml @@ -13,6 +13,7 @@ context: | - 代码: 模块文件150-300行; 错误需自定义异常+清晰信息+位置上下文 - 项目阶段: 未上线,无用户,破坏性变更无需迁移说明 - Git提交: 仅中文; 格式为"类型: 简短描述",类型可选: feat(新功能)/fix(修复)/refactor(重构)/docs(文档)/style(格式)/test(测试)/chore(构建/工具); 多行描述空行后加详细说明 + - 提问: 对用户的提问优先使用提问工具而不是文字选项 # 项目概述 - 目标:统一文档解析工具,将DOCX/XLSX/PPTX/PDF/HTML/URL 转换为 Markdown,面向AI skill使用 # 项目目录结构 diff --git a/openspec/specs/multi-platform-dependencies/spec.md b/openspec/specs/multi-platform-dependencies/spec.md index 28cb6f1..90b01ce 100644 --- a/openspec/specs/multi-platform-dependencies/spec.md +++ b/openspec/specs/multi-platform-dependencies/spec.md @@ -22,12 +22,42 @@ - 必须使用 Python 3.12 - `docling-parse` 5.x 无 x86_64 wheel,必须使用 4.0.0 - 提供完整的 `uv run --python 3.12 --with "docling==2.40.0" --with "docling-parse==4.0.0" --with "numpy<2" ...` 命令示例 + - unstructured 在 Darwin-x86_64 平台不可用,已从配置中移除 #### Scenario: 每个平台的运行命令 - **WHEN** 用户阅读 SKILL.md - **THEN** 系统必须为每个平台(Windows/macOS Intel/macOS ARM/Linux)和每种文档格式提供清晰的 `uv run --with` 命令示例 - **AND** 命令必须包含所有必需的依赖包 +### Requirement: 依赖配置结构 +config.py 中的 DEPENDENCIES 配置使用字典结构,保持简单直接以便于在不同平台进行细致调整。 + +#### Scenario: 配置数据格式不变 +- **WHEN** 代码访问 config.DEPENDENCIES["pdf"]["default"] +- **THEN** 返回的数据结构保持不变 +- **AND** 包含 "python" 和 "dependencies" 字段 + +#### Scenario: 所有文件类型都有 Darwin-x86_64 配置 +- **WHEN** 查看 config.DEPENDENCIES +- **THEN** pdf/docx/xlsx/pptx/xls/ppt 都有 "Darwin-x86_64" 平台配置 +- **AND** Darwin-x86_64 配置中不包含 unstructured 相关依赖 + +### Requirement: 依赖版本管理 +所有依赖必须指定版本号,default 平台使用截止 2026-03-17 的最新版本,Darwin-x86_64 平台使用已验证可用的版本。 + +#### Scenario: default 平台使用最新版本 +- **WHEN** 查看 config.DEPENDENCIES 中 default 配置的依赖 +- **THEN** 所有依赖都有明确的版本号 +- **AND** docling 使用 2.80.0 +- **AND** docling-parse 使用 5.5.0 +- **AND** markitdown 使用 0.1.5 + +#### Scenario: Darwin-x86_64 平台使用验证版本 +- **WHEN** 查看 config.DEPENDENCIES 中 Darwin-x86_64 配置的依赖 +- **THEN** docling 使用 2.40.0 +- **AND** docling-parse 使用 4.0.0 +- **AND** numpy 使用 <2 + ### Requirement: 平台检测文档 系统必须在 `SKILL.md` 中提供平台检测方法和平台特定的安装指南。 diff --git a/openspec/specs/test-fixtures/spec.md b/openspec/specs/test-fixtures/spec.md index 7d18c24..0670170 100644 --- a/openspec/specs/test-fixtures/spec.md +++ b/openspec/specs/test-fixtures/spec.md @@ -6,6 +6,23 @@ ## Requirements +### Requirement: 测试运行器包含 fixtures 依赖 +run_tests.py 必须定义 TEST_FIXTURE_DEPENDENCIES 常量,包含创建临时测试文件所需的所有依赖。 + +#### Scenario: TEST_FIXTURE_DEPENDENCIES 定义存在 +- **WHEN** 查看 run_tests.py +- **THEN** 存在 TEST_FIXTURE_DEPENDENCIES 常量 +- **AND** 包含 python-docx(用于创建临时 DOCX) +- **AND** 包含 reportlab(用于创建临时 PDF) +- **AND** 包含 pandas(用于创建临时 XLSX) +- **AND** 包含 openpyxl(pandas 写 XLSX 需要) +- **AND** 包含 python-pptx(用于创建临时 PPTX) + +#### Scenario: fixtures 依赖与文件类型依赖合并 +- **WHEN** 运行任何类型的测试 +- **THEN** TEST_FIXTURE_DEPENDENCIES 中的依赖自动合并到 uv run --with 参数中 +- **AND** 去重处理,避免重复添加 + ### Requirement: 临时文件自动清理 测试使用的临时文件 MUST 在测试完成后自动清理,使用 pytest 的 tmp_path fixture。 diff --git a/openspec/specs/test-runner/spec.md b/openspec/specs/test-runner/spec.md index 0a06f46..be53dd3 100644 --- a/openspec/specs/test-runner/spec.md +++ b/openspec/specs/test-runner/spec.md @@ -12,21 +12,26 @@ #### Scenario: 运行 PDF 测试 - **WHEN** 用户执行 `python run_tests.py pdf` - **THEN** 自动加载 config.DEPENDENCIES["pdf"] 中的依赖 +- **AND** 自动加载测试 fixtures 所需的依赖 - **AND** 运行 tests/test_readers/test_pdf/ 目录下的测试 #### Scenario: 运行 DOCX 测试 - **WHEN** 用户执行 `python run_tests.py docx` - **THEN** 自动加载 config.DEPENDENCIES["docx"] 中的依赖 +- **AND** 自动加载测试 fixtures 所需的依赖 - **AND** 运行 tests/test_readers/test_docx/ 目录下的测试 #### Scenario: 运行 CLI 测试(无特殊依赖) - **WHEN** 用户执行 `python run_tests.py cli` -- **THEN** 仅加载 pytest 依赖 +- **THEN** 加载 pytest 依赖 +- **AND** 自动加载测试 fixtures 所需的依赖 +- **AND** 加载 config.DEPENDENCIES 中所有类型的依赖(去重) - **AND** 运行 tests/test_cli/ 目录下的测试 #### Scenario: 运行所有测试 - **WHEN** 用户执行 `python run_tests.py all` - **THEN** 加载 config.DEPENDENCIES 中所有类型的依赖(去重) +- **AND** 自动加载测试 fixtures 所需的依赖 - **AND** 运行 tests/ 目录下的所有测试 ### Requirement: 测试运行器支持透传 pytest 参数 diff --git a/run_tests.py b/run_tests.py index 7ae5776..586d4cf 100644 --- a/run_tests.py +++ b/run_tests.py @@ -23,6 +23,24 @@ os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1" os.environ["HF_HUB_DISABLE_TELEMETRY"] = "1" os.environ["TQDM_DISABLE"] = "1" +# 测试 fixtures 需要的依赖(用于创建临时测试文件) +TEST_FIXTURE_DEPENDENCIES = { + "default": [ + "python-docx==1.2.0", # 用于创建临时 DOCX + "reportlab==4.2.2", # 用于创建临时 PDF + "pandas==3.0.1", # 用于创建临时 XLSX + "openpyxl==3.1.5", # pandas 写 XLSX 需要 + "python-pptx==1.0.2", # 用于创建临时 PPTX + ], + "Darwin-x86_64": [ + "python-docx==1.2.0", # 用于创建临时 DOCX + "reportlab==4.2.2", # 用于创建临时 PDF + "pandas<3.0.0", # 用于创建临时 XLSX(兼容 Darwin-x86_64) + "openpyxl==3.1.5", # pandas 写 XLSX 需要 + "python-pptx==1.0.2", # 用于创建临时 PPTX + ], +} + # 测试类型映射 _TEST_TYPES = { # 文件类型测试(有依赖配置) @@ -34,8 +52,8 @@ _TEST_TYPES = { "xls": {"key": "xls", "path": "tests/test_readers/test_xls/"}, "doc": {"key": "doc", "path": "tests/test_readers/test_doc/"}, "ppt": {"key": "ppt", "path": "tests/test_readers/test_ppt/"}, - # 核心测试(无特殊依赖) - "cli": {"key": None, "path": "tests/test_cli/"}, + # 核心测试(cli 测试需要所有依赖,因为它测试多种格式) + "cli": {"key": "all", "path": "tests/test_cli/"}, "core": {"key": None, "path": "tests/test_core/"}, "utils": {"key": None, "path": "tests/test_utils/"}, # 所有测试(合并所有依赖) @@ -43,9 +61,40 @@ _TEST_TYPES = { } +def _collect_all_dependencies(platform_id: str): + """ + 收集所有文件类型的依赖并去重(内部辅助函数)。 + + Args: + platform_id: 平台标识 + + Returns: + (python_version, dependencies) 元组 + """ + from config import DEPENDENCIES + + python_version = None + all_deps = set() + for type_key, type_config in DEPENDENCIES.items(): + # 先尝试特定平台配置 + if platform_id in type_config: + cfg = type_config[platform_id] + elif "default" in type_config: + cfg = type_config["default"] + else: + continue + # 记录 python 版本(优先使用有特殊要求的) + if cfg.get("python") and not python_version: + python_version = cfg["python"] + # 收集依赖 + for dep in cfg.get("dependencies", []): + all_deps.add(dep) + return python_version, list(all_deps) + + def get_dependencies_for_type(test_type: str, platform_id: str): """ - 获取指定测试类型的依赖配置。 + 获取指定测试类型的依赖配置(完全从 config.py 获取)。 Args: test_type: 测试类型(pdf/docx/.../all) @@ -63,30 +112,14 @@ def get_dependencies_for_type(test_type: str, platform_id: str): key = config["key"] if key is None: - # 无特殊依赖的测试类型(cli/core/utils) + # core/utils 测试不需要特殊依赖 return None, [] if key == "all": - # 收集所有类型的依赖并去重 - python_version = None - all_deps = set() - for type_key, type_config in DEPENDENCIES.items(): - # 先尝试特定平台配置 - if platform_id in type_config: - cfg = type_config[platform_id] - elif "default" in type_config: - cfg = type_config["default"] - else: - continue - # 记录 python 版本(优先使用有特殊要求的) - if cfg.get("python"): - python_version = cfg["python"] - # 收集依赖 - for dep in cfg.get("dependencies", []): - all_deps.add(dep) - return python_version, list(all_deps) + # cli 和 all 都使用收集所有依赖的逻辑 + return _collect_all_dependencies(platform_id) - # 单个类型的依赖 + # 单个类型的依赖,完全从 config.py 获取 if key not in DEPENDENCIES: return None, [] @@ -101,11 +134,30 @@ def get_dependencies_for_type(test_type: str, platform_id: str): return cfg.get("python"), cfg.get("dependencies", []) +def get_fixture_dependencies(platform_id: str): + """ + 获取指定平台的 fixtures 依赖。 + + Args: + platform_id: 平台标识 + + Returns: + list: fixtures 依赖列表 + """ + if platform_id in TEST_FIXTURE_DEPENDENCIES: + return TEST_FIXTURE_DEPENDENCIES[platform_id] + elif "default" in TEST_FIXTURE_DEPENDENCIES: + return TEST_FIXTURE_DEPENDENCIES["default"] + else: + return [] + + def generate_uv_args( dependencies: list, test_path: str, pytest_args: list, python_version: str = None, + platform_id: str = None, ): """ 生成 uv run 命令参数列表(用于 subprocess.run)。 @@ -115,6 +167,7 @@ def generate_uv_args( test_path: 测试路径 pytest_args: 透传给 pytest 的参数 python_version: 需要的 python 版本,None 表示不指定 + platform_id: 平台标识,用于选择 fixtures 依赖 Returns: uv run 命令参数列表 @@ -127,8 +180,18 @@ def generate_uv_args( # 添加 pytest args.extend(["--with", "pytest"]) - # 添加其他依赖 + # 获取当前平台的 fixtures 依赖 + fixture_deps = get_fixture_dependencies(platform_id) if platform_id else [] + + # 合并文件类型依赖和 fixtures 依赖,去重 + all_deps = set() for dep in dependencies: + all_deps.add(dep) + for dep in fixture_deps: + all_deps.add(dep) + + # 添加所有依赖 + for dep in sorted(all_deps): args.extend(["--with", dep]) # 添加 pytest 命令 @@ -205,6 +268,7 @@ def main(): test_path=test_path, pytest_args=pytest_args, python_version=python_version, + platform_id=platform_id, ) # 设置环境变量 diff --git a/scripts/config.py b/scripts/config.py index aa835e6..00e814d 100644 --- a/scripts/config.py +++ b/scripts/config.py @@ -24,13 +24,13 @@ class Config: DEPENDENCIES = { "pdf": { "default": { - "python": None, + "python": "3.12", "dependencies": [ - "docling", + "docling==2.80.0", "unstructured[pdf]", - "markitdown[pdf]", - "pypdf", - "markdownify" + "markitdown[pdf]==0.1.5", + "pypdf==6.9.0", + "markdownify==0.13.1" ] }, "Darwin-x86_64": { @@ -39,94 +39,22 @@ DEPENDENCIES = { "docling==2.40.0", "docling-parse==4.0.0", "numpy<2", - "markitdown[pdf]", - "pypdf", - "markdownify" + "markitdown[pdf]==0.1.5", + "pypdf==6.9.0", + "markdownify==0.13.1" ] } }, "docx": { "default": { - "python": None, + "python": "3.12", "dependencies": [ - "docling", + "docling==2.80.0", "unstructured[docx]", - "markitdown[docx]", - "pypandoc-binary", - "python-docx", - "markdownify" - ] - } - }, - "xlsx": { - "default": { - "python": None, - "dependencies": [ - "docling", - "unstructured[xlsx]", - "markitdown[xlsx]", - "pandas", - "tabulate" - ] - } - }, - "pptx": { - "default": { - "python": None, - "dependencies": [ - "docling", - "unstructured[pptx]", - "markitdown[pptx]", - "python-pptx", - "markdownify" - ] - } - }, - "html": { - "default": { - "python": None, - "dependencies": [ - "trafilatura", - "domscribe", - "markitdown", - "html2text", - "beautifulsoup4", - "httpx", - "chardet", - "pyppeteer", - "selenium" - ] - } - }, - "xls": { - "default": { - "python": None, - "dependencies": [ - "unstructured[xlsx]", - "markitdown[xls]", - "pandas", - "tabulate", - "xlrd", - "olefile" - ] - } - }, - "doc": { - "default": { - "python": None, - "dependencies": [] - } - }, - "ppt": { - "default": { - "python": None, - "dependencies": [ - "docling", - "unstructured[pptx]", - "markitdown[pptx]", - "python-pptx", - "markdownify", - "olefile" + "markitdown[docx]==0.1.5", + "pypandoc-binary==1.13", + "python-docx==1.2.0", + "markdownify==0.13.1" ] }, "Darwin-x86_64": { @@ -135,10 +63,129 @@ DEPENDENCIES = { "docling==2.40.0", "docling-parse==4.0.0", "numpy<2", - "markitdown[pptx]", - "python-pptx", - "markdownify", - "olefile" + "markitdown[docx]==0.1.5", + "pypandoc-binary==1.13", + "python-docx==1.2.0", + "markdownify==0.13.1" + ] + } + }, + "xlsx": { + "default": { + "python": "3.12", + "dependencies": [ + "docling==2.80.0", + "unstructured[xlsx]", + "markitdown[xlsx]==0.1.5", + "pandas==3.0.1", + "tabulate==0.9.0", + "openpyxl==3.1.5" + ] + }, + "Darwin-x86_64": { + "python": "3.12", + "dependencies": [ + "docling==2.40.0", + "docling-parse==4.0.0", + "numpy<2", + "markitdown[xlsx]==0.1.5", + "pandas<3.0.0", + "tabulate==0.9.0", + "openpyxl==3.1.5" + ] + } + }, + "pptx": { + "default": { + "python": "3.12", + "dependencies": [ + "docling==2.80.0", + "unstructured[pptx]", + "markitdown[pptx]==0.1.5", + "python-pptx==1.0.2", + "markdownify==0.13.1" + ] + }, + "Darwin-x86_64": { + "python": "3.12", + "dependencies": [ + "docling==2.40.0", + "docling-parse==4.0.0", + "numpy<2", + "markitdown[pptx]==0.1.5", + "python-pptx==1.0.2", + "markdownify==0.13.1" + ] + } + }, + "html": { + "default": { + "python": "3.12", + "dependencies": [ + "trafilatura==1.12.2", + "domscribe", + "markitdown==0.1.5", + "html2text==2024.2.26", + "beautifulsoup4==4.14.3", + "httpx==0.28.1", + "chardet==5.2.0", + "pyppeteer==2.0.0", + "selenium==4.25.0" + ] + } + }, + "xls": { + "default": { + "python": "3.12", + "dependencies": [ + "unstructured[xlsx]", + "markitdown[xls]==0.1.5", + "pandas==3.0.1", + "tabulate==0.9.0", + "xlrd==2.0.1", + "olefile==0.47" + ] + }, + "Darwin-x86_64": { + "python": "3.12", + "dependencies": [ + "markitdown[xls]==0.1.5", + "pandas<3.0.0", + "tabulate==0.9.0", + "xlrd==2.0.1", + "olefile==0.47", + "openpyxl==3.1.5" + ] + } + }, + "doc": { + "default": { + "python": "3.12", + "dependencies": [] + } + }, + "ppt": { + "default": { + "python": "3.12", + "dependencies": [ + "docling==2.80.0", + "unstructured[pptx]", + "markitdown[pptx]==0.1.5", + "python-pptx==1.0.2", + "markdownify==0.13.1", + "olefile==0.47" + ] + }, + "Darwin-x86_64": { + "python": "3.12", + "dependencies": [ + "docling==2.40.0", + "docling-parse==4.0.0", + "numpy<2", + "markitdown[pptx]==0.1.5", + "python-pptx==1.0.2", + "markdownify==0.13.1", + "olefile==0.47" ] } } diff --git a/tests/conftest.py b/tests/conftest.py index 780c563..523d075 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -105,11 +105,29 @@ def temp_pdf(tmp_path): c = canvas.Canvas(str(file_path), pagesize=letter) # 尝试注册中文字体(如果可用) + font_loaded = False try: - # 使用系统字体 - pdfmetrics.registerFont(TTFont('SimSun', 'simsun.ttc')) - c.setFont('SimSun', 12) + # 尝试 macOS 中文字体 + for font_name, font_path, font_index in [ + ('PingFangSC', '/System/Library/Fonts/PingFang.ttc', 0), + ('STHeiti', '/System/Library/Fonts/STHeiti Light.ttc', 0), + ('STHeitiMedium', '/System/Library/Fonts/STHeiti Medium.ttc', 0), + ]: + try: + from reportlab.pdfbase.ttfonts import TTFont + import os + if os.path.exists(font_path): + # For TTC files, we need to specify the font index + pdfmetrics.registerFont(TTFont(font_name, font_path, subfontIndex=font_index)) + c.setFont(font_name, 12) + font_loaded = True + break + except Exception as e: + continue except Exception: + pass + + if not font_loaded: # 回退到默认字体 c.setFont('Helvetica', 12) diff --git a/tests/test_core/test_advice_generator.py b/tests/test_core/test_advice_generator.py index 5f875c7..7e207b7 100644 --- a/tests/test_core/test_advice_generator.py +++ b/tests/test_core/test_advice_generator.py @@ -68,21 +68,24 @@ class TestGetDependencies: def test_get_default_dependencies(self): """测试获取默认依赖配置。""" python_ver, deps = get_dependencies(DocxReader, "Unknown-Platform") - assert python_ver is None + assert python_ver == "3.12" assert len(deps) > 0 - assert "docling" in deps + # 检查是否有 docling 相关依赖(可能带版本号) + assert any(dep.startswith("docling") for dep in deps) def test_get_pdf_dependencies(self): """测试获取 PDF 依赖。""" python_ver, deps = get_dependencies(PdfReader, "Darwin-arm64") - assert python_ver is None - assert "docling" in deps + assert python_ver == "3.12" + # 检查是否有 docling 相关依赖(可能带版本号) + assert any(dep.startswith("docling") for dep in deps) def test_get_html_dependencies(self): """测试获取 HTML 依赖。""" python_ver, deps = get_dependencies(HtmlReader, "Linux-x86_64") - assert python_ver is None - assert "trafilatura" in deps + assert python_ver == "3.12" + # 检查是否有 trafilatura 相关依赖(可能带版本号) + assert any(dep.startswith("trafilatura") for dep in deps) class TestGenerateUvCommand: