docs: 重构 README.md 和 SKILL.md，明确文档职责

- README.md 面向开发者：添加项目概述、核心概念说明、开发指南 - SKILL.md 面向 AI：强化 --advice 作为首选方案，明确三路径执行优先级 - 更新 specs：skill-documentation 和 uv-with-dependency-management - README 添加 reportlab 测试依赖
2026-03-09 21:56:58 +08:00
parent 9abc0a0707
commit 25d748aa17
4 changed files with 211 additions and 209 deletions
--- a/README.md
+++ b/README.md
@@ -2,76 +2,97 @@

 统一文档解析工具 - 将 DOCX、XLSX、PPTX、PDF、HTML/URL 转换为 Markdown

+## 项目概述
+
+面向 AI Skill 的统一文档解析工具，支持多种文档格式解析为 Markdown，提供全文输出、字数统计、标题提取、内容搜索等功能。
+
 ## 开发环境

 - 使用 uv 运行脚本和测试，禁用主机 Python
 - 依赖管理：使用 `uv run --with` 按需加载依赖
- 快速获取建议：使用 `-a/--advice` 参数查看执行命令，无需手动查找依赖
+- 快速获取建议：使用 `-a/--advice` 参数查看执行命令

-## 项目结构
+## 项目架构

 ```
-scripts/                    # 核心代码
-├── core/                   # 核心模块
-│   ├── advice_generator.py # 执行建议生成器（新增）
-│   ├── parser.py           # 解析调度
-│   ├── exceptions.py       # 异常定义
-│   └── markdown.py         # Markdown 工具
-├── readers/                # 格式阅读器
-├── utils/                  # 工具函数
-└── config.py               # 配置（含 DEPENDENCIES 依赖配置）
-tests/                      # 测试
-openspec/                   # 规范文档
-skill/                      # SKILL 文档
+scripts/
+├── lyxy_document_reader.py    # CLI 入口
+├── config.py                   # 配置（含 DEPENDENCIES 依赖配置）
+├── core/                       # 核心模块
+│   ├── parser.py              # 解析调度
+│   ├── advice_generator.py    # --advice 执行建议生成器
+│   ├── markdown.py            # Markdown 工具
+│   └── exceptions.py          # 异常定义
+├── readers/                    # 格式阅读器
+│   ├── base.py                # Reader 基类
+│   ├── docx/                  # DOCX 解析器
+│   ├── xlsx/                  # XLSX 解析器
+│   ├── pptx/                  # PPTX 解析器
+│   ├── pdf/                   # PDF 解析器
+│   └── html/                  # HTML/URL 解析器
+└── utils/                      # 工具函数
+    ├── file_detection.py      # 文件检测
+    └── encoding_detection.py  # 编码检测
+
+tests/                           # 测试套件
+openspec/                        # OpenSpec 规范文档
+README.md                        # 本文档（开发者文档）
+SKILL.md                         # AI Skill 文档
 ```

-## 开发工作流
+## 核心概念

-使用 `uv run --with` 方式运行测试和开发工具：
+### Reader 机制

+每种文档格式对应一个 Reader 包，包含多个解析实现。Reader 基类定义 `supports()` 和 `parse()` 方法，解析器按顺序尝试，第一个成功的结果返回。
+
+### 依赖配置 (config.DEPENDENCIES)
+
+按文件类型和平台组织依赖配置：
+
+```python
+DEPENDENCIES = {
+    "pdf": {
+        "default": {
+            "python": None,
+            "dependencies": ["docling", "unstructured[pdf]", ...]
+        },
+        "Darwin-x86_64": {
+            "python": "3.12",
+            "dependencies": ["docling==2.40.0", ...]
+        }
+    },
+    ...
+}
+```
+
+### --advice 生成机制
+
+`--advice` 参数根据文件扩展名识别类型，检测当前平台，从 `config.DEPENDENCIES` 读取对应配置，生成 `uv run --with` 和 `pip install` 命令。
+
+## 开发指南
+
+### 如何添加新的 Reader
+
+1. 在 `scripts/readers/` 下创建新目录
+2. 继承 `BaseReader` 实现 `supports()` 和 `parse()`
+3. 在 `scripts/readers/__init__.py` 中注册
+4. 在 `config.DEPENDENCIES` 中添加依赖配置
+
+### 如何测试
+
+项目包含完整的测试套件，覆盖 CLI 和所有 Reader 实现。根据测试类型使用对应的 `uv run --with` 命令。
+
+#### 运行所有测试
 ```bash
-# 运行测试（需要先安装 pytest）
 uv run \
  --with pytest \
  --with pytest-cov \
  --with chardet \
  pytest
-
-# 运行测试并查看覆盖率
-uv run \
-  --with pytest \
-  --with pytest-cov \
-  --with chardet \
-  pytest --cov=scripts --cov-report=term-missing
-
-# 运行特定测试文件
-uv run \
-  --with pytest \
-  --with chardet \
-  pytest tests/test_readers/test_docx/
-
-# 运行特定测试类或方法
-uv run \
-  --with pytest \
-  --with chardet \
-  pytest tests/test_cli/test_main.py::TestCLIDefaultOutput::test_default_output_docx
-
-# 代码格式化
-uv run \
-  --with black \
-  --with isort \
-  --with chardet \
-  bash -c "black . && isort ."
-
-# 类型检查
-uv run \
-  --with mypy \
-  --with chardet \
-  mypy .
 ```

-**测试 DOCX reader**：
-
+#### 测试 DOCX reader
 ```bash
 uv run \
  --with pytest \
@@ -85,8 +106,33 @@ uv run \
  pytest tests/test_readers/test_docx/
 ```

-**测试 PDF reader**：
+#### 测试 XLSX reader
+```bash
+uv run \
+  --with pytest \
+  --with docling \
+  --with "unstructured[xlsx]" \
+  --with "markitdown[xlsx]" \
+  --with pandas \
+  --with tabulate \
+  --with chardet \
+  pytest tests/test_readers/test_xlsx/
+```

+#### 测试 PPTX reader
+```bash
+uv run \
+  --with pytest \
+  --with docling \
+  --with "unstructured[pptx]" \
+  --with "markitdown[pptx]" \
+  --with python-pptx \
+  --with markdownify \
+  --with chardet \
+  pytest tests/test_readers/test_pptx/
+```
+
+#### 测试 PDF reader
 ```bash
 # 默认命令（macOS ARM、Linux、Windows）
 uv run \
@@ -97,6 +143,7 @@ uv run \
  --with pypdf \
  --with markdownify \
  --with chardet \
+  --with reportlab \
  pytest tests/test_readers/test_pdf/

 # macOS x86_64 (Intel) 特殊命令
@@ -110,35 +157,12 @@ uv run \
  --with pypdf \
  --with markdownify \
  --with chardet \
+  --with reportlab \
  pytest tests/test_readers/test_pdf/
 ```

-**测试其他格式**：
-
+#### 测试 HTML reader
 ```bash
-# XLSX reader
-uv run \
-  --with pytest \
-  --with docling \
-  --with "unstructured[xlsx]" \
-  --with "markitdown[xlsx]" \
-  --with pandas \
-  --with tabulate \
-  --with chardet \
-  pytest tests/test_readers/test_xlsx/
-
-# PPTX reader
-uv run \
-  --with pytest \
-  --with docling \
-  --with "unstructured[pptx]" \
-  --with "markitdown[pptx]" \
-  --with python-pptx \
-  --with markdownify \
-  --with chardet \
-  pytest tests/test_readers/test_pptx/
-
-# HTML reader
 uv run \
  --with pytest \
  --with trafilatura \
@@ -151,74 +175,43 @@ uv run \
  pytest tests/test_readers/test_html/
 ```

-## 测试
+#### 运行特定测试文件或方法
+```bash
+# 运行特定测试文件
+uv run \
+  --with pytest \
+  --with chardet \
+  pytest tests/test_cli/test_main.py

-项目包含完整的测试套件,覆盖 CLI 和所有 Reader 实现:
+# 运行特定测试类或方法
+uv run \
+  --with pytest \
+  --with docling \
+  --with chardet \
+  pytest tests/test_cli/test_main.py::TestCLIDefaultOutput::test_default_output_docx
+```

- **测试覆盖率**: 69%
- **测试数量**: 193 个测试
- **测试类型**:
-  - CLI 功能测试（字数统计、行数统计、标题提取、搜索等）
-  - Reader 解析测试（DOCX、PDF、HTML、PPTX、XLSX）
-  - 多 Reader 实现测试（每种格式测试多个解析库）
-  - 异常场景测试（文件不存在、空文件、损坏文件、特殊字符）
-  - 编码测试（GBK、UTF-8 BOM 等）
-  - 一致性测试（验证不同 Reader 解析结果的一致性）
+#### 查看测试覆盖率
+```bash
+uv run \
+  --with pytest \
+  --with pytest-cov \
+  --with chardet \
+  pytest --cov=scripts --cov-report=term-missing
+```

-运行测试前，请根据测试类型使用 `uv run --with` 安装对应的依赖包。详见上方的"开发工作流"章节。
-
-
-## 代码规范
+### 代码规范

 - 语言：仅中文（交流、注释、文档、代码）
 - 模块文件：150-300 行
 - 错误处理：自定义异常 + 清晰信息 + 位置上下文
- Git 提交：类型: 简短描述（feat/fix/refactor/docs/style/test/chore）
+- Git 提交：`类型: 简短描述`（feat/fix/refactor/docs/style/test/chore）

-## Skill 文档规范
+## 文档说明

-skill/SKILL.md 面向 AI 用户，必须遵循 Claude Skill 构建指南的最佳实践：
-
-### YAML frontmatter
-
- **name**: kebab-case 格式
- **description**: 包含功能说明、触发词、文件类型、典型任务
- **license**: MIT
- **metadata**: 包含 version、author
- **compatibility**: 说明 Python 版本要求和依赖情况
-
-### 文档章节结构
-
-1. **Purpose**: 说明统一入口和双路径执行策略
-2. **When to Use**: 典型场景和触发词列表（中英文、文件扩展名）
-3. **Quick Reference**: 命令参数表格
-4. **Workflow**: 4 步工作流程（检测环境、识别类型、执行解析、输出结果）
-5. **使用示例**: 各文档类型的基本用法和高级用法
-6. **错误处理**: 常见错误及解决方案
-7. **References**: 指向项目文档的链接
-
-### 依赖管理
-
- 使用 `uv run --with` 方式按需加载依赖
- 必须使用具体的 pip 包名
- 使用 `-a/--advice` 参数可快速获取针对具体文件的执行命令
-
-## 解析器架构
-
-### DOCX
-docling、unstructured、pypandoc-binary、MarkItDown、python-docx、XML
-
-### XLSX
-docling、unstructured、MarkItDown、pandas、XML
-
-### PPTX
-docling、unstructured、MarkItDown、python-pptx、XML
-
-### PDF（OCR 优先）
-docling OCR、unstructured OCR、docling、unstructured、MarkItDown、pypdf
-
-### HTML/URL
-trafilatura、domscribe、MarkItDown、html2text
+- **README.md**（本文档）：面向项目开发者
+- **SKILL.md**：面向 AI 使用的 Skill 文档
+- **openspec/**：OpenSpec 规范文档

 ## 许可证