创建 lyxy-reader-html skill

- 新增 skill: lyxy-reader-html，用于解析 HTML 文件和 URL 网页内容 - 支持 URL 下载（pyppeteer → selenium → httpx → urllib 优先级回退） - 支持 HTML 解析（trafilatura → domscribe → MarkItDown → html2text 优先级回退） - 支持查询功能：全文提取、字数统计、行数统计、标题提取、章节提取、正则搜索 - 新增 spec: html-document-parsing - 归档 change: create-lyxy-reader-html-skill
2026-03-08 02:02:03 +08:00
parent 0bd9ec8a36
commit 6b4fcf2647
16 changed files with 1827 additions and 3 deletions
--- a/skills/lyxy-reader-html/references/error-handling.md
+++ b/skills/lyxy-reader-html/references/error-handling.md
@@ -0,0 +1,54 @@
+# 错误处理和限制说明
+
+## 限制
+
+- 不支持图片提取（仅纯文本）
+- 不支持复杂的格式保留（字体、颜色、布局等）
+- 不支持文档编辑或修改
+- 仅支持 URL、.html、.htm 格式
+- pyppeteer 和 selenium 需要额外配置环境变量
+
+## 最佳实践
+
+1. **必须优先使用 lyxy-runner-python**：如果环境中存在，必须使用 lyxy-runner-python 执行脚本
+2. **查阅 README**：详细参数、依赖安装、下载器/解析器对比等信息请阅读 `scripts/README.md`
+3. **JS 渲染网页**：对于需要 JS 渲染的网页，确保安装 pyppeteer 或 selenium 并正确配置环境变量
+4. **轻量使用**：如果目标网页不需要 JS 渲染，可以只安装 httpx/urllib 以获得更快的下载速度
+5. **禁止自动安装**：降级到直接 Python 执行时，仅向用户提示安装依赖，不得自动执行 pip install
+
+## 依赖执行策略
+
+### 必须使用 lyxy-runner-python
+
+如果环境中存在 lyxy-runner-python skill，**必须**使用它来执行 parser.py 脚本：
+- lyxy-runner-python 使用 uv 管理依赖，自动安装所需的第三方库
+- 环境隔离，不污染系统 Python
+- 跨平台兼容（Windows/macOS/Linux）
+
+### 降级到直接执行
+
+**仅当** lyxy-runner-python skill 不存在时，才降级到直接 Python 执行：
+- 需要用户手动安装依赖
+- 至少需要安装 html2text 和 beautifulsoup4
+- **禁止自动执行 pip install**，仅向用户提示安装建议
+
+## JS 渲染配置
+
+### pyppeteer 配置
+
+- 首次运行会自动下载 Chromium（需要网络连接）
+- 或设置 `LYXY_CHROMIUM_BINARY` 环境变量指定 Chromium/Chrome 可执行文件路径
+
+### selenium 配置
+
+必须设置两个环境变量：
+- `LYXY_CHROMIUM_DRIVER` - ChromeDriver 可执行文件路径
+- `LYXY_CHROMIUM_BINARY` - Chromium/Chrome 可执行文件路径
+
+## 不适用场景
+
+- 需要提取图片内容（仅支持纯文本）
+- 需要保留复杂的格式信息（字体、颜色、布局）
+- 需要编辑或修改文档
+- 需要登录或认证才能访问的网页（需自行处理 Cookie/Token）
+- 需要处理动态内容加载但不使用 JS 渲染的情况
--- a/skills/lyxy-reader-html/references/examples.md
+++ b/skills/lyxy-reader-html/references/examples.md
@@ -0,0 +1,59 @@
+# 示例
+
+## URL 输入 - 提取完整文档内容
+
+```bash
+# 使用 uv（推荐）
+uv run --with trafilatura --with domscribe --with markitdown --with html2text --with httpx --with beautifulsoup4 scripts/parser.py https://example.com
+
+# 直接使用 Python
+python scripts/parser.py https://example.com
+```
+
+## HTML 文件输入 - 提取完整文档内容
+
+```bash
+# 使用 uv（推荐）
+uv run --with trafilatura --with domscribe --with markitdown --with html2text --with beautifulsoup4 scripts/parser.py page.html
+
+# 直接使用 Python
+python scripts/parser.py page.html
+```
+
+## 获取文档字数
+
+```bash
+uv run --with trafilatura --with html2text --with beautifulsoup4 scripts/parser.py -c https://example.com
+```
+
+## 获取文档行数
+
+```bash
+uv run --with trafilatura --with html2text --with beautifulsoup4 scripts/parser.py -l https://example.com
+```
+
+## 提取所有标题
+
+```bash
+uv run --with trafilatura --with html2text --with beautifulsoup4 scripts/parser.py -t https://example.com
+```
+
+## 提取指定章节
+
+```bash
+uv run --with trafilatura --with html2text --with beautifulsoup4 scripts/parser.py -tc "关于我们" https://example.com
+```
+
+## 搜索关键词
+
+```bash
+uv run --with trafilatura --with html2text --with beautifulsoup4 scripts/parser.py -s "关键词" -n 3 https://example.com
+```
+
+## 降级到直接 Python 执行
+
+仅当 lyxy-runner-python skill 不存在时使用：
+
+```bash
+python3 scripts/parser.py https://example.com
+```
--- a/skills/lyxy-reader-html/references/parsers.md
+++ b/skills/lyxy-reader-html/references/parsers.md
@@ -0,0 +1,68 @@
+# 解析器说明和依赖安装
+
+## 多策略解析降级
+
+URL 下载器按 pyppeteer → selenium → httpx → urllib 优先级依次尝试；HTML 解析器按 trafilatura → domscribe → MarkItDown → html2text 优先级依次尝试。前一个失败自动回退到下一个。
+
+详细的优先级和对比请查阅 `scripts/README.md`。
+
+## 依赖安装
+
+### 使用 uv（推荐）
+
+```bash
+# 完整安装（所有下载器和解析器）
+uv run --with trafilatura --with domscribe --with markitdown --with html2text --with httpx --with pyppeteer --with selenium --with beautifulsoup4 scripts/parser.py https://example.com
+
+# 轻量安装（仅 httpx + html2text）
+uv run --with html2text --with beautifulsoup4 scripts/parser.py https://example.com
+```
+
+> **说明**：以上为推荐安装命令，包含所有组件以获得最佳兼容性。详细的优先级和对比请查阅 `scripts/README.md`。
+
+## 下载器对比
+
+| 下载器 | 优点 | 缺点 | 适用场景 |
+|--------|------|------|---------|
+| **pyppeteer** | 支持 JS 渲染；现代网页兼容性好 | 依赖重；首次需下载 Chromium | 需要 JS 渲染的现代网页 |
+| **selenium** | 支持 JS 渲染；成熟稳定 | 需配置 Chromium driver 和 binary | 需要 JS 渲染的现代网页 |
+| **httpx** | 轻量快速；现代 HTTP 客户端 | 不支持 JS 渲染 | 静态网页；快速下载 |
+| **urllib** | Python 标准库；无需安装 | 不支持 JS 渲染 | 静态网页；兜底方案 |
+
+## 解析器对比
+
+| 解析器 | 优点 | 缺点 | 适用场景 |
+|--------|------|------|---------|
+| **trafilatura** | 专门用于网页正文提取；输出质量高 | 可能无法提取某些页面 | 大多数网页正文提取 |
+| **domscribe** | 专注内容提取 | 相对较新 | 网页内容提取 |
+| **MarkItDown** | 微软官方；格式规范 | 输出较简洁 | 标准格式转换 |
+| **html2text** | 经典库；兼容性好 | 作为兜底方案 | 兜底解析 |
+
+## 能力说明
+
+### 1. URL / HTML 文件输入
+支持两种输入方式：
+- URL：自动下载网页内容（支持 JS 渲染）
+- 本地 HTML 文件：直接读取并解析
+
+### 2. 全文转换为 Markdown
+将完整 HTML 解析为 Markdown 格式，移除图片但保留文本格式（标题、列表、表格、粗体、斜体等）。
+
+### 3. HTML 预处理清理
+解析前自动清理 HTML：
+- 移除 script/style/link/svg 标签
+- 移除 href/src/srcset/action 等 URL 属性
+- 移除 style 属性
+
+### 4. 获取文档元信息
+- 字数统计（`-c` 参数）
+- 行数统计（`-l` 参数）
+
+### 5. 标题列表提取
+提取文档中所有 1-6 级标题（`-t` 参数），按原始层级关系返回。
+
+### 6. 指定章节内容提取
+根据标题名称提取特定章节的完整内容（`-tc` 参数），包含上级标题链和所有下级内容。
+
+### 7. 正则表达式搜索
+在文档中搜索关键词或模式（`-s` 参数），支持自定义上下文行数（`-n` 参数，默认 2 行）。