docs: 分离用户文档与开发文档

- 将 README.md 重构为开发文档，包含开发环境、工作流、代码规范 - 新建 skill/SKILL.md 作为用户文档，包含快速开始和命令选项 - 更新 openspec/config.yaml 添加项目概述和 skill 目录声明
2026-03-08 18:08:44 +08:00
parent 15b63800a8
commit b98e70383c
3 changed files with 172 additions and 179 deletions
--- a/README.md
+++ b/README.md
@@ -1,200 +1,63 @@
 # lyxy-document
-帮助 AI 工具读取转换文档到 Markdown 的统一工具。
+统一文档解析工具 - 将 DOCX、XLSX、PPTX、PDF、HTML/URL 转换为 Markdown
-## 功能特性
+## 开发环境
-支持多种文档格式的解析，统一转换为 Markdown：
+- 使用 uv 管理依赖，禁用主机 Python
-
+- 依赖声明：pyproject.toml
- **Office 文档**: DOCX、XLSX、PPTX
+- 安装：uv sync
 - **PDF 文档**: 支持 OCR 优先解析
 - **HTML/URL**: 支持本地 HTML 文件和在线 URL 下载解析
 ## 安装
 ### 基础安装
 ```bash
 uv add lyxy-document
 ```
 ### 可选依赖分组
 按文件类型按需安装：
 ```bash
 # 仅安装 DOCX 支持
 uv add "lyxy-document[docx]"
 # 仅安装 XLSX 支持
 uv add "lyxy-document[xlsx]"
 # 仅安装 PPTX 支持
 uv add "lyxy-document[pptx]"
 # 仅安装 PDF 支持
 uv add "lyxy-document[pdf]"
 # 仅安装 HTML 支持
 uv add "lyxy-document[html]"
 # 仅安装 HTTP/URL 下载支持
 uv add "lyxy-document[http]"
 ```
 组合分组：
 ```bash
 # 安装所有 Office 格式支持
 uv add "lyxy-document[office]"
 # 安装 Web（HTML + HTTP）支持
 uv add "lyxy-document[web]"
 # 安装全部功能
 uv add "lyxy-document[full]"
 # 安装开发依赖
 uv add --dev "lyxy-document[dev]"
 ```
 ## 使用方法
 ### 命令行使用
 ```bash
 # 解析文档为 Markdown
 uv run lyxy-document-reader document.docx
 # 统计字数
 uv run lyxy-document-reader document.docx -c
 # 统计行数
 uv run lyxy-document-reader document.docx -l
 # 提取所有标题
 uv run lyxy-document-reader document.docx -t
 # 提取指定标题及其内容
 uv run lyxy-document-reader document.docx -tc "标题名称"
 # 搜索文档内容（支持正则表达式）
 uv run lyxy-document-reader document.docx -s "搜索关键词" -n 2
 ```
 ### Python API 使用
 ```python
 from core import parse_input, process_content
 from readers import READERS
 # 实例化 readers
 readers = [ReaderCls() for ReaderCls in READERS]
 # 解析文件
 content, failures = parse_input("document.docx", readers)
 if content:
    # 处理内容（移除图片、规范化空白）
    content = process_content(content)
    print(content)
 else:
    print("解析失败:")
    for failure in failures:
        print(failure)
 ```
 ## 项目结构
 ```
-lyxy-document/
+scripts/          # 核心代码
-├── scripts/                    # 核心代码目录
+├── core/         # 核心模块（解析调度、异常、Markdown 工具）
-│   ├── lyxy_document_reader.py # 统一 CLI 入口
+├── readers/      # 格式阅读器
-│   ├── config.py               # 统一配置类
+└── utils/        # 工具函数
-│   ├── core/                   # 核心模块
+tests/            # 测试
-│   │   ├── exceptions.py       # 自定义异常体系
+openspec/         # 规范文档
-│   │   ├── markdown.py         # Markdown 工具函数
+skill/            # SKILL 文档
 │   │   └── parser.py           # 统一解析调度器
 │   ├── readers/                # 格式阅读器
 │   │   ├── base.py             # Reader 基类
 │   │   ├── docx/               # DOCX 阅读器
 │   │   ├── xlsx/               # XLSX 阅读器
 │   │   ├── pptx/               # PPTX 阅读器
 │   │   ├── pdf/                # PDF 阅读器
 │   │   └── html/               # HTML/URL 阅读器
 │   └── utils/                  # 工具函数
 │       ├── file_detection.py   # 文件类型检测
 │       └── encoding_detection.py # 编码检测
 ├── tests/                      # 测试
 ├── openspec/                   # 规范文档
 └── README.md                   # 项目文档
 ```
-## 解析器优先级
+## 开发工作流
 ### DOCX
 1. docling
 2. unstructured
 3. pypandoc-binary
 4. MarkItDown
 5. python-docx
 6. XML 原生解析
 ### XLSX
 1. docling
 2. unstructured
 3. MarkItDown
 4. pandas
 5. XML 原生解析
 ### PPTX
 1. docling
 2. unstructured
 3. MarkItDown
 4. python-pptx
 5. XML 原生解析
 ### PDF（OCR 优先）
 1. docling OCR
 2. unstructured OCR
 3. docling
 4. unstructured
 5. MarkItDown
 6. pypdf
 ### HTML/URL
 1. trafilatura
 2. domscribe
 3. MarkItDown
 4. html2text
 ## 开发
 ### 安装开发依赖
 ```bash
 uv sync --dev
 ```
 ### 运行测试
 ```bash
 # 运行测试
 uv run pytest
 ```
-### 代码格式化
+# 代码格式化
 ```bash
 uv run black .
 uv run isort .
 ```
-### 类型检查
+# 类型检查
 ```bash
 uv run mypy .
 ```
 ## 代码规范
 - 语言：仅中文（交流、注释、文档、代码）
 - 模块文件：150-300 行
 - 错误处理：自定义异常 + 清晰信息 + 位置上下文
 - Git 提交：类型: 简短描述（feat/fix/refactor/docs/style/test/chore）
 ## 解析器架构
 ### DOCX
 docling、unstructured、pypandoc-binary、MarkItDown、python-docx、XML
 ### XLSX
 docling、unstructured、MarkItDown、pandas、XML
 ### PPTX
 docling、unstructured、MarkItDown、python-pptx、XML
 ### PDF（OCR 优先）
 docling OCR、unstructured OCR、docling、unstructured、MarkItDown、pypdf
 ### HTML/URL
 trafilatura、domscribe、MarkItDown、html2text
 ## 许可证
 MIT License
--- a/openspec/config.yaml
+++ b/openspec/config.yaml
@@ -1,12 +1,16 @@
 schema: spec-driven
 context: |
  # 项目概述
  - 目标：统一文档解析工具，将DOCX/XLSX/PPTX/PDF/HTML/URL 转换为 Markdown，面向AI skill使用
  # 项目规范
  - 语言: 仅中文(交流/注释/文档/代码)
  - Python: 始终用uv运行(脚本/临时命令uv run python -c); 禁用主机python/禁主机安装包
  - 依赖: pyproject.toml声明,使用uv安装
  - 主机环境: 禁止污染配置,需操作须请求用户
-  - 文档: README.md,每次迭代按需更新用户文档和开发文档; 禁emoji/特殊字符
+  - 开发文档: README.md,每次迭代按需更新开发文档; 禁emoji/特殊字符
  - skill文档: skill/SKILL.md,每次迭代按需更新skill文档
  - 测试: 所有需求必须设计全面测试
  - 任务: 禁止创建git变更任务(push/commit等); git读取允许(status/log/diff等)
  - 代码: 模块文件150-300行; 错误需自定义异常+清晰信息+位置上下文
@@ -15,8 +19,9 @@ context: |
  # 项目目录结构
  - scripts/: 核心代码目录
  - skill/: skill文档目录
  - tests/: 测试目录
  - openspec/: 规范文档目录
  - temp/: 开发临时文件目录
  - pyproject.toml: 项目配置
-  - README.md: 项目文档
+  - README.md: 项目开发文档
--- a/skill/SKILL.md
+++ b/skill/SKILL.md
@@ -0,0 +1,125 @@
 ---
 name: lyxy-document-reader
 description: 统一文档解析工具 - DOCX/XLSX/PPTX/PDF/HTML/URL 转 Markdown
 license: MIT
 metadata:
  version: "1.0"
 ---
 # 快速开始
 ```bash
 # 基本解析
 uv run lyxy-document-reader document.docx
 # URL 解析
 uv run lyxy-document-reader https://example.com
 ```
 # 命令选项
 ## 基本参数
 - `input_path`：文件路径或 URL（必需）
 ## 互斥操作（选其一）
 | 选项 | 说明 |
 |------|------|
 | 无 | 输出完整 Markdown |
 | `-c` / `--count` | 统计字数 |
 | `-l` / `--lines` | 统计行数 |
 | `-t` / `--titles` | 提取所有标题（1-6级） |
 | `-tc <name>` | 提取指定标题及其内容 |
 | `-s <pattern>` | 正则搜索 |
 ## 辅助选项
 | 选项 | 说明 | 配合 |
 |------|------|------|
 | `-n <num>` / `--context <num>` | 搜索结果上下文行数（默认2） | `-s` |
 # 按文档类型使用
 ## DOCX
 ```bash
 uv run lyxy-document-reader file.docx
 ```
 ## PDF
 ```bash
 uv run lyxy-document-reader file.pdf
 ```
 ## HTML/URL
 ```bash
 # 本地文件
 uv run lyxy-document-reader page.html
 # URL
 uv run lyxy-document-reader https://example.com
 ```
 ## XLSX
 ```bash
 uv run lyxy-document-reader file.xlsx
 ```
 ## PPTX
 ```bash
 uv run lyxy-document-reader file.pptx
 ```
 # 高级用法
 ## 搜索内容
 ```bash
 # 搜索关键词
 uv run lyxy-document-reader file.docx -s "关键词"
 # 指定上下文行数
 uv run lyxy-document-reader file.docx -s "关键词" -n 5
 # 正则表达式
 uv run lyxy-document-reader file.docx -s "\d{4}-\d{2}-\d{2}"
 ```
 ## 提取标题
 ```bash
 # 列出所有标题
 uv run lyxy-document-reader file.docx -t
 # 提取指定标题内容
 uv run lyxy-document-reader file.docx -tc "第三章"
 ```
 # Python API
 ```python
 from scripts.core import parse_input, process_content
 from scripts.readers import READERS
 readers = [ReaderCls() for ReaderCls in READERS]
 content, failures = parse_input("document.docx", readers)
 if content:
    content = process_content(content)
    print(content)
 ```
 # 错误处理
 | 错误信息 | 原因 | 解决 |
 |---------|------|------|
 | 错误: input_path 不能为空 | 未提供输入 | 提供 file_path 或 URL |
 | 错误: 不支持的文件类型 | 无对应 reader | 检查文件扩展名 |
 | 所有解析方法均失败 | 所有解析器失败 | 检查文件是否损坏 |
 | 错误: 无效的正则表达式 | 正则语法错误 | 检查正则语法 |
 | 错误: 未找到匹配 | 搜索无结果 | 检查搜索词或正则 |