docs: 分离用户文档与开发文档

- 将 README.md 重构为开发文档，包含开发环境、工作流、代码规范 - 新建 skill/SKILL.md 作为用户文档，包含快速开始和命令选项 - 更新 openspec/config.yaml 添加项目概述和 skill 目录声明
2026-03-08 18:08:44 +08:00
parent 15b63800a8
commit b98e70383c
3 changed files with 172 additions and 179 deletions
--- a/README.md
+++ b/README.md
@@ -1,200 +1,63 @@
 # lyxy-document

-帮助 AI 工具读取转换文档到 Markdown 的统一工具。
+统一文档解析工具 - 将 DOCX、XLSX、PPTX、PDF、HTML/URL 转换为 Markdown

-## 功能特性
+## 开发环境

-支持多种文档格式的解析，统一转换为 Markdown：
-
- **Office 文档**: DOCX、XLSX、PPTX
- **PDF 文档**: 支持 OCR 优先解析
- **HTML/URL**: 支持本地 HTML 文件和在线 URL 下载解析
-
-## 安装
-
-### 基础安装
-
-```bash
-uv add lyxy-document
-```
-
-### 可选依赖分组
-
-按文件类型按需安装：
-
-```bash
-# 仅安装 DOCX 支持
-uv add "lyxy-document[docx]"
-
-# 仅安装 XLSX 支持
-uv add "lyxy-document[xlsx]"
-
-# 仅安装 PPTX 支持
-uv add "lyxy-document[pptx]"
-
-# 仅安装 PDF 支持
-uv add "lyxy-document[pdf]"
-
-# 仅安装 HTML 支持
-uv add "lyxy-document[html]"
-
-# 仅安装 HTTP/URL 下载支持
-uv add "lyxy-document[http]"
-```
-
-组合分组：
-
-```bash
-# 安装所有 Office 格式支持
-uv add "lyxy-document[office]"
-
-# 安装 Web（HTML + HTTP）支持
-uv add "lyxy-document[web]"
-
-# 安装全部功能
-uv add "lyxy-document[full]"
-
-# 安装开发依赖
-uv add --dev "lyxy-document[dev]"
-```
-
-## 使用方法
-
-### 命令行使用
-
-```bash
-# 解析文档为 Markdown
-uv run lyxy-document-reader document.docx
-
-# 统计字数
-uv run lyxy-document-reader document.docx -c
-
-# 统计行数
-uv run lyxy-document-reader document.docx -l
-
-# 提取所有标题
-uv run lyxy-document-reader document.docx -t
-
-# 提取指定标题及其内容
-uv run lyxy-document-reader document.docx -tc "标题名称"
-
-# 搜索文档内容（支持正则表达式）
-uv run lyxy-document-reader document.docx -s "搜索关键词" -n 2
-```
-
-### Python API 使用
-
-```python
-from core import parse_input, process_content
-from readers import READERS
-
-# 实例化 readers
-readers = [ReaderCls() for ReaderCls in READERS]
-
-# 解析文件
-content, failures = parse_input("document.docx", readers)
-
-if content:
-    # 处理内容（移除图片、规范化空白）
-    content = process_content(content)
-    print(content)
-else:
-    print("解析失败:")
-    for failure in failures:
-        print(failure)
-```
+- 使用 uv 管理依赖，禁用主机 Python
+- 依赖声明：pyproject.toml
+- 安装：uv sync

 ## 项目结构

 ```
-lyxy-document/
-├── scripts/                    # 核心代码目录
-│   ├── lyxy_document_reader.py # 统一 CLI 入口
-│   ├── config.py               # 统一配置类
-│   ├── core/                   # 核心模块
-│   │   ├── exceptions.py       # 自定义异常体系
-│   │   ├── markdown.py         # Markdown 工具函数
-│   │   └── parser.py           # 统一解析调度器
-│   ├── readers/                # 格式阅读器
-│   │   ├── base.py             # Reader 基类
-│   │   ├── docx/               # DOCX 阅读器
-│   │   ├── xlsx/               # XLSX 阅读器
-│   │   ├── pptx/               # PPTX 阅读器
-│   │   ├── pdf/                # PDF 阅读器
-│   │   └── html/               # HTML/URL 阅读器
-│   └── utils/                  # 工具函数
-│       ├── file_detection.py   # 文件类型检测
-│       └── encoding_detection.py # 编码检测
-├── tests/                      # 测试
-├── openspec/                   # 规范文档
-└── README.md                   # 项目文档
+scripts/          # 核心代码
+├── core/         # 核心模块（解析调度、异常、Markdown 工具）
+├── readers/      # 格式阅读器
+└── utils/        # 工具函数
+tests/            # 测试
+openspec/         # 规范文档
+skill/            # SKILL 文档
 ```

-## 解析器优先级
-
-### DOCX
-1. docling
-2. unstructured
-3. pypandoc-binary
-4. MarkItDown
-5. python-docx
-6. XML 原生解析
-
-### XLSX
-1. docling
-2. unstructured
-3. MarkItDown
-4. pandas
-5. XML 原生解析
-
-### PPTX
-1. docling
-2. unstructured
-3. MarkItDown
-4. python-pptx
-5. XML 原生解析
-
-### PDF（OCR 优先）
-1. docling OCR
-2. unstructured OCR
-3. docling
-4. unstructured
-5. MarkItDown
-6. pypdf
-
-### HTML/URL
-1. trafilatura
-2. domscribe
-3. MarkItDown
-4. html2text
-
-## 开发
-
-### 安装开发依赖
-
-```bash
-uv sync --dev
-```
-
-### 运行测试
+## 开发工作流

 ```bash
+# 运行测试
 uv run pytest
-```

-### 代码格式化
-
-```bash
+# 代码格式化
 uv run black .
 uv run isort .
-```

-### 类型检查
-
-```bash
+# 类型检查
 uv run mypy .
 ```

+## 代码规范
+
+- 语言：仅中文（交流、注释、文档、代码）
+- 模块文件：150-300 行
+- 错误处理：自定义异常 + 清晰信息 + 位置上下文
+- Git 提交：类型: 简短描述（feat/fix/refactor/docs/style/test/chore）
+
+## 解析器架构
+
+### DOCX
+docling、unstructured、pypandoc-binary、MarkItDown、python-docx、XML
+
+### XLSX
+docling、unstructured、MarkItDown、pandas、XML
+
+### PPTX
+docling、unstructured、MarkItDown、python-pptx、XML
+
+### PDF（OCR 优先）
+docling OCR、unstructured OCR、docling、unstructured、MarkItDown、pypdf
+
+### HTML/URL
+trafilatura、domscribe、MarkItDown、html2text
+
 ## 许可证

 MIT License
--- a/openspec/config.yaml
+++ b/openspec/config.yaml
@@ -1,12 +1,16 @@
 schema: spec-driven

 context: |
+  # 项目概述
+  - 目标：统一文档解析工具，将DOCX/XLSX/PPTX/PDF/HTML/URL 转换为 Markdown，面向AI skill使用
+
  # 项目规范
  - 语言: 仅中文(交流/注释/文档/代码)
  - Python: 始终用uv运行(脚本/临时命令uv run python -c); 禁用主机python/禁主机安装包
  - 依赖: pyproject.toml声明,使用uv安装
  - 主机环境: 禁止污染配置,需操作须请求用户
-  - 文档: README.md,每次迭代按需更新用户文档和开发文档; 禁emoji/特殊字符
+  - 开发文档: README.md,每次迭代按需更新开发文档; 禁emoji/特殊字符
+  - skill文档: skill/SKILL.md,每次迭代按需更新skill文档
  - 测试: 所有需求必须设计全面测试
  - 任务: 禁止创建git变更任务(push/commit等); git读取允许(status/log/diff等)
  - 代码: 模块文件150-300行; 错误需自定义异常+清晰信息+位置上下文
@@ -15,8 +19,9 @@ context: |

  # 项目目录结构
  - scripts/: 核心代码目录
+  - skill/: skill文档目录
  - tests/: 测试目录
  - openspec/: 规范文档目录
  - temp/: 开发临时文件目录
  - pyproject.toml: 项目配置
-  - README.md: 项目文档
+  - README.md: 项目开发文档
--- a/skill/SKILL.md
+++ b/skill/SKILL.md
@@ -0,0 +1,125 @@
+---
+name: lyxy-document-reader
+description: 统一文档解析工具 - DOCX/XLSX/PPTX/PDF/HTML/URL 转 Markdown
+license: MIT
+metadata:
+  version: "1.0"
+---
+
+# 快速开始
+
+```bash
+# 基本解析
+uv run lyxy-document-reader document.docx
+
+# URL 解析
+uv run lyxy-document-reader https://example.com
+```
+
+# 命令选项
+
+## 基本参数
+
+- `input_path`：文件路径或 URL（必需）
+
+## 互斥操作（选其一）
+
+| 选项 | 说明 |
+|------|------|
+| 无 | 输出完整 Markdown |
+| `-c` / `--count` | 统计字数 |
+| `-l` / `--lines` | 统计行数 |
+| `-t` / `--titles` | 提取所有标题（1-6级） |
+| `-tc <name>` | 提取指定标题及其内容 |
+| `-s <pattern>` | 正则搜索 |
+
+## 辅助选项
+
+| 选项 | 说明 | 配合 |
+|------|------|------|
+| `-n <num>` / `--context <num>` | 搜索结果上下文行数（默认2） | `-s` |
+
+# 按文档类型使用
+
+## DOCX
+
+```bash
+uv run lyxy-document-reader file.docx
+```
+
+## PDF
+
+```bash
+uv run lyxy-document-reader file.pdf
+```
+
+## HTML/URL
+
+```bash
+# 本地文件
+uv run lyxy-document-reader page.html
+
+# URL
+uv run lyxy-document-reader https://example.com
+```
+
+## XLSX
+
+```bash
+uv run lyxy-document-reader file.xlsx
+```
+
+## PPTX
+
+```bash
+uv run lyxy-document-reader file.pptx
+```
+
+# 高级用法
+
+## 搜索内容
+
+```bash
+# 搜索关键词
+uv run lyxy-document-reader file.docx -s "关键词"
+
+# 指定上下文行数
+uv run lyxy-document-reader file.docx -s "关键词" -n 5
+
+# 正则表达式
+uv run lyxy-document-reader file.docx -s "\d{4}-\d{2}-\d{2}"
+```
+
+## 提取标题
+
+```bash
+# 列出所有标题
+uv run lyxy-document-reader file.docx -t
+
+# 提取指定标题内容
+uv run lyxy-document-reader file.docx -tc "第三章"
+```
+
+# Python API
+
+```python
+from scripts.core import parse_input, process_content
+from scripts.readers import READERS
+
+readers = [ReaderCls() for ReaderCls in READERS]
+content, failures = parse_input("document.docx", readers)
+
+if content:
+    content = process_content(content)
+    print(content)
+```
+
+# 错误处理
+
+| 错误信息 | 原因 | 解决 |
+|---------|------|------|
+| 错误: input_path 不能为空 | 未提供输入 | 提供 file_path 或 URL |
+| 错误: 不支持的文件类型 | 无对应 reader | 检查文件扩展名 |
+| 所有解析方法均失败 | 所有解析器失败 | 检查文件是否损坏 |
+| 错误: 无效的正则表达式 | 正则语法错误 | 检查正则语法 |
+| 错误: 未找到匹配 | 搜索无结果 | 检查搜索词或正则 |