优化skill提示词

2026-02-25 17:36:42 +08:00
parent ae3b123eeb
commit cd40a58f33
22 changed files with 1264 additions and 889 deletions
--- a/skills/lyxy-reader-office/references/error-handling.md
+++ b/skills/lyxy-reader-office/references/error-handling.md
@@ -0,0 +1,41 @@
+# 错误处理和限制说明
+
+## 限制
+
+- 不支持图片提取（仅纯文本）
+- 不支持复杂的格式保留（字体、颜色、布局等）
+- 不支持文档编辑或修改
+- 仅支持 .docx、.xlsx、.pptx、.pdf 格式（不支持 .doc、.xls、.ppt 等旧格式）
+- PDF 无内置 XML 原生解析，至少需要安装 pypdf
+
+## 最佳实践
+
+1. **必须优先使用 lyxy-runner-python**：如果环境中存在，必须使用 lyxy-runner-python 执行脚本
+2. **查阅 README**：详细参数、依赖安装、解析器对比等信息请阅读 `scripts/README.md`
+3. **大文件处理**：对于大文档，建议使用章节提取（`-tc`）或搜索（`-s`）来限制处理范围
+4. **PDF 标题**：PDF 是版面描述格式，默认不含语义化标题；需要标题层级时使用 `--high-res`
+5. **禁止自动安装**：降级到直接 Python 执行时，仅向用户提示安装依赖，不得自动执行 pip install
+
+## 依赖执行策略
+
+### 必须使用 lyxy-runner-python
+
+如果环境中存在 lyxy-runner-python skill，**必须**使用它来执行 parser.py 脚本：
+- lyxy-runner-python 使用 uv 管理依赖，自动安装所需的第三方库
+- 环境隔离，不污染系统 Python
+- 跨平台兼容（Windows/macOS/Linux）
+
+### 降级到直接执行
+
+**仅当** lyxy-runner-python skill 不存在时，才降级到直接 Python 执行：
+- 需要用户手动安装依赖
+- DOCX/PPTX/XLSX 无需依赖也可通过 XML 原生解析工作
+- PDF 至少需要安装 pypdf
+- **禁止自动执行 pip install**，仅向用户提示安装建议
+
+## 不适用场景
+
+- 需要提取图片内容（仅支持纯文本）
+- 需要保留复杂的格式信息（字体、颜色、布局）
+- 需要编辑或修改文档
+- 需要处理 .doc、.xls、.ppt 等旧格式
--- a/skills/lyxy-reader-office/references/examples.md
+++ b/skills/lyxy-reader-office/references/examples.md
@@ -0,0 +1,55 @@
+# 示例
+
+## 提取完整文档内容
+
+```bash
+# DOCX
+uv run --with "markitdown[docx]" skills/lyxy-reader-office/scripts/parser.py /path/to/report.docx
+
+# PPTX
+uv run --with "markitdown[pptx]" skills/lyxy-reader-office/scripts/parser.py /path/to/slides.pptx
+
+# XLSX
+uv run --with "markitdown[xlsx]" skills/lyxy-reader-office/scripts/parser.py /path/to/data.xlsx
+
+# PDF
+uv run --with "markitdown[pdf]" --with pypdf skills/lyxy-reader-office/scripts/parser.py /path/to/doc.pdf
+```
+
+## 获取文档字数
+
+```bash
+uv run --with "markitdown[docx]" skills/lyxy-reader-office/scripts/parser.py -c /path/to/report.docx
+```
+
+## 提取所有标题
+
+```bash
+uv run --with "markitdown[docx]" skills/lyxy-reader-office/scripts/parser.py -t /path/to/report.docx
+```
+
+## 提取指定章节
+
+```bash
+uv run --with "markitdown[docx]" skills/lyxy-reader-office/scripts/parser.py -tc "第一章" /path/to/report.docx
+```
+
+## 搜索关键词
+
+```bash
+uv run --with "markitdown[docx]" skills/lyxy-reader-office/scripts/parser.py -s "关键词" -n 3 /path/to/report.docx
+```
+
+## PDF OCR 高精度解析
+
+```bash
+uv run --with docling --with pypdf skills/lyxy-reader-office/scripts/parser.py /path/to/scanned.pdf --high-res
+```
+
+## 降级到直接 Python 执行
+
+仅当 lyxy-runner-python skill 不存在时使用：
+
+```bash
+python3 skills/lyxy-reader-office/scripts/parser.py /path/to/file.docx
+```
--- a/skills/lyxy-reader-office/references/parsers.md
+++ b/skills/lyxy-reader-office/references/parsers.md
@@ -0,0 +1,58 @@
+# 解析器说明和依赖安装
+
+## 多策略解析降级
+
+每种文件格式配备多个解析器，按优先级依次尝试，前一个失败自动回退到下一个。
+
+详细的解析器优先级和对比请查阅 `scripts/README.md`。
+
+## 依赖安装
+
+### 使用 uv（推荐）
+
+```bash
+# DOCX - 推荐依赖
+uv run --with "markitdown[docx]" skills/lyxy-reader-office/scripts/parser.py /path/to/file.docx
+
+# PPTX - 推荐依赖
+uv run --with "markitdown[pptx]" skills/lyxy-reader-office/scripts/parser.py /path/to/file.pptx
+
+# XLSX - 推荐依赖
+uv run --with "markitdown[xlsx]" skills/lyxy-reader-office/scripts/parser.py /path/to/file.xlsx
+
+# PDF - 推荐依赖
+uv run --with "markitdown[pdf]" --with pypdf skills/lyxy-reader-office/scripts/parser.py /path/to/file.pdf
+
+# PDF OCR 高精度模式
+uv run --with docling --with pypdf skills/lyxy-reader-office/scripts/parser.py /path/to/file.pdf --high-res
+```
+
+> **注意**：以上为最小推荐依赖，更多解析器依赖和完整安装命令请查阅 `scripts/README.md` 的安装部分。
+
+## 各格式输出特点
+
+- **DOCX**：标准 Markdown 文档结构
+- **PPTX**：每张幻灯片以 `## Slide N` 为标题，幻灯片之间以 `---` 分隔
+- **XLSX**：以 `## SheetName` 区分工作表，数据以 Markdown 表格呈现
+- **PDF**：纯文本流，使用 `--high-res` 可启用 OCR 版面分析识别标题
+
+## 能力说明
+
+### 1. 全文转换为 Markdown
+将完整文档解析为 Markdown 格式，移除图片但保留文本格式（标题、列表、表格、粗体、斜体等）。
+
+### 2. 获取文档元信息
+- 字数统计（`-c` 参数）
+- 行数统计（`-l` 参数）
+
+### 3. 标题列表提取
+提取文档中所有 1-6 级标题（`-t` 参数），按原始层级关系返回。
+
+### 4. 指定章节内容提取
+根据标题名称提取特定章节的完整内容（`-tc` 参数），包含上级标题链和所有下级内容。
+
+### 5. 正则表达式搜索
+在文档中搜索关键词或模式（`-s` 参数），支持自定义上下文行数（`-n` 参数，默认 2 行）。
+
+### 6. PDF OCR 高精度模式
+对 PDF 文件启用 OCR 版面分析（`--high-res` 参数），适用于扫描版 PDF 或需要识别标题层级的场景。