docs: 分离用户文档与开发文档

- 将 README.md 重构为开发文档，包含开发环境、工作流、代码规范 - 新建 skill/SKILL.md 作为用户文档，包含快速开始和命令选项 - 更新 openspec/config.yaml 添加项目概述和 skill 目录声明
2026-03-08 18:08:44 +08:00
parent 15b63800a8
commit b98e70383c
3 changed files with 172 additions and 179 deletions
--- a/README.md
+++ b/README.md
@@ -1,200 +1,63 @@
 # lyxy-document

-帮助 AI 工具读取转换文档到 Markdown 的统一工具。
+统一文档解析工具 - 将 DOCX、XLSX、PPTX、PDF、HTML/URL 转换为 Markdown

-## 功能特性
+## 开发环境

-支持多种文档格式的解析，统一转换为 Markdown：
-
- **Office 文档**: DOCX、XLSX、PPTX
- **PDF 文档**: 支持 OCR 优先解析
- **HTML/URL**: 支持本地 HTML 文件和在线 URL 下载解析
-
-## 安装
-
-### 基础安装
-
-```bash
-uv add lyxy-document
-```
-
-### 可选依赖分组
-
-按文件类型按需安装：
-
-```bash
-# 仅安装 DOCX 支持
-uv add "lyxy-document[docx]"
-
-# 仅安装 XLSX 支持
-uv add "lyxy-document[xlsx]"
-
-# 仅安装 PPTX 支持
-uv add "lyxy-document[pptx]"
-
-# 仅安装 PDF 支持
-uv add "lyxy-document[pdf]"
-
-# 仅安装 HTML 支持
-uv add "lyxy-document[html]"
-
-# 仅安装 HTTP/URL 下载支持
-uv add "lyxy-document[http]"
-```
-
-组合分组：
-
-```bash
-# 安装所有 Office 格式支持
-uv add "lyxy-document[office]"
-
-# 安装 Web（HTML + HTTP）支持
-uv add "lyxy-document[web]"
-
-# 安装全部功能
-uv add "lyxy-document[full]"
-
-# 安装开发依赖
-uv add --dev "lyxy-document[dev]"
-```
-
-## 使用方法
-
-### 命令行使用
-
-```bash
-# 解析文档为 Markdown
-uv run lyxy-document-reader document.docx
-
-# 统计字数
-uv run lyxy-document-reader document.docx -c
-
-# 统计行数
-uv run lyxy-document-reader document.docx -l
-
-# 提取所有标题
-uv run lyxy-document-reader document.docx -t
-
-# 提取指定标题及其内容
-uv run lyxy-document-reader document.docx -tc "标题名称"
-
-# 搜索文档内容（支持正则表达式）
-uv run lyxy-document-reader document.docx -s "搜索关键词" -n 2
-```
-
-### Python API 使用
-
-```python
-from core import parse_input, process_content
-from readers import READERS
-
-# 实例化 readers
-readers = [ReaderCls() for ReaderCls in READERS]
-
-# 解析文件
-content, failures = parse_input("document.docx", readers)
-
-if content:
-    # 处理内容（移除图片、规范化空白）
-    content = process_content(content)
-    print(content)
-else:
-    print("解析失败:")
-    for failure in failures:
-        print(failure)
-```
+- 使用 uv 管理依赖，禁用主机 Python
+- 依赖声明：pyproject.toml
+- 安装：uv sync

 ## 项目结构

 ```
-lyxy-document/
-├── scripts/                    # 核心代码目录
-│   ├── lyxy_document_reader.py # 统一 CLI 入口
-│   ├── config.py               # 统一配置类
-│   ├── core/                   # 核心模块
-│   │   ├── exceptions.py       # 自定义异常体系
-│   │   ├── markdown.py         # Markdown 工具函数
-│   │   └── parser.py           # 统一解析调度器
-│   ├── readers/                # 格式阅读器
-│   │   ├── base.py             # Reader 基类
-│   │   ├── docx/               # DOCX 阅读器
-│   │   ├── xlsx/               # XLSX 阅读器
-│   │   ├── pptx/               # PPTX 阅读器
-│   │   ├── pdf/                # PDF 阅读器
-│   │   └── html/               # HTML/URL 阅读器
-│   └── utils/                  # 工具函数
-│       ├── file_detection.py   # 文件类型检测
-│       └── encoding_detection.py # 编码检测
-├── tests/                      # 测试
-├── openspec/                   # 规范文档
-└── README.md                   # 项目文档
+scripts/          # 核心代码
+├── core/         # 核心模块（解析调度、异常、Markdown 工具）
+├── readers/      # 格式阅读器
+└── utils/        # 工具函数
+tests/            # 测试
+openspec/         # 规范文档
+skill/            # SKILL 文档
 ```

-## 解析器优先级
-
-### DOCX
-1. docling
-2. unstructured
-3. pypandoc-binary
-4. MarkItDown
-5. python-docx
-6. XML 原生解析
-
-### XLSX
-1. docling
-2. unstructured
-3. MarkItDown
-4. pandas
-5. XML 原生解析
-
-### PPTX
-1. docling
-2. unstructured
-3. MarkItDown
-4. python-pptx
-5. XML 原生解析
-
-### PDF（OCR 优先）
-1. docling OCR
-2. unstructured OCR
-3. docling
-4. unstructured
-5. MarkItDown
-6. pypdf
-
-### HTML/URL
-1. trafilatura
-2. domscribe
-3. MarkItDown
-4. html2text
-
-## 开发
-
-### 安装开发依赖
-
-```bash
-uv sync --dev
-```
-
-### 运行测试
+## 开发工作流

 ```bash
+# 运行测试
 uv run pytest
-```

-### 代码格式化
-
-```bash
+# 代码格式化
 uv run black .
 uv run isort .
-```

-### 类型检查
-
-```bash
+# 类型检查
 uv run mypy .
 ```

+## 代码规范
+
+- 语言：仅中文（交流、注释、文档、代码）
+- 模块文件：150-300 行
+- 错误处理：自定义异常 + 清晰信息 + 位置上下文
+- Git 提交：类型: 简短描述（feat/fix/refactor/docs/style/test/chore）
+
+## 解析器架构
+
+### DOCX
+docling、unstructured、pypandoc-binary、MarkItDown、python-docx、XML
+
+### XLSX
+docling、unstructured、MarkItDown、pandas、XML
+
+### PPTX
+docling、unstructured、MarkItDown、python-pptx、XML
+
+### PDF（OCR 优先）
+docling OCR、unstructured OCR、docling、unstructured、MarkItDown、pypdf
+
+### HTML/URL
+trafilatura、domscribe、MarkItDown、html2text
+
 ## 许可证

 MIT License