创建 lyxy-reader-html skill

- 新增 skill: lyxy-reader-html，用于解析 HTML 文件和 URL 网页内容 - 支持 URL 下载（pyppeteer → selenium → httpx → urllib 优先级回退） - 支持 HTML 解析（trafilatura → domscribe → MarkItDown → html2text 优先级回退） - 支持查询功能：全文提取、字数统计、行数统计、标题提取、章节提取、正则搜索 - 新增 spec: html-document-parsing - 归档 change: create-lyxy-reader-html-skill
2026-03-08 02:02:03 +08:00
parent 0bd9ec8a36
commit 6b4fcf2647
16 changed files with 1827 additions and 3 deletions
--- a/skills/lyxy-reader-html/SKILL.md
+++ b/skills/lyxy-reader-html/SKILL.md
@@ -0,0 +1,75 @@
+---
+name: lyxy-reader-html
+description: 解析 HTML 文件和 URL 网页内容的 skill，将 HTML 转换为 Markdown 格式，支持全文提取、标题提取、章节提取、正则搜索、字数统计、行数统计。URL 模式下自动下载网页内容，支持 JS 渲染。使用时请阅读 scripts/README.md 获取详细用法。
+compatibility: Requires Python 3.6+. 推荐通过 lyxy-runner-python skill 使用 uv 自动管理依赖。
+---
+
+# HTML 网页解析 Skill
+
+将 HTML 文件或 URL 网页内容解析为 Markdown 格式，支持多种查询模式。
+
+## Purpose
+
+**统一入口**：使用 `scripts/parser.py` 作为统一的命令行入口，自动识别输入类型（URL 或 HTML 文件）并执行解析。
+
+**依赖选项**：此 skill 必须优先使用 lyxy-runner-python skill 执行，不可用时降级到直接 Python 执行。
+
+## When to Use
+
+任何需要读取或解析 HTML 文件、URL 网页内容的任务都应使用此 skill。
+
+### 典型场景
+- **网页内容提取**：将 URL 或本地 HTML 文件转换为可读的 Markdown 文本
+- **文档元数据**：获取文档的字数、行数等信息
+- **标题分析**：提取文档的标题结构
+- **章节提取**：提取特定章节的内容
+- **内容搜索**：在文档中搜索关键词或模式
+
+### 触发词
+- 中文："读取/解析/打开 html/htm 网页/URL"
+- 英文："read/parse/extract html/htm web page url"
+- 文件扩展名：`.html`、`.htm`
+- URL 模式：`http://`、`https://`
+
+## Quick Reference
+
+| 参数 | 说明 |
+|------|------|
+| （无参数） | 输出完整 Markdown 内容 |
+| `-c` | 字数统计 |
+| `-l` | 行数统计 |
+| `-t` | 提取所有标题 |
+| `-tc <name>` | 提取指定标题的章节内容 |
+| `-s <pattern>` | 正则表达式搜索 |
+| `-n <num>` | 与 `-s` 配合，指定上下文行数 |
+
+## Workflow
+
+1. **检查依赖**：优先使用 lyxy-runner-python，否则降级到直接 Python 执行
+2. **识别输入**：自动判断是 URL 还是本地 HTML 文件
+3. **下载内容**：URL 模式下按 pyppeteer → selenium → httpx → urllib 优先级下载
+4. **清理 HTML**：移除 script/style/link/svg 等标签和 URL 属性
+5. **执行解析**：按 trafilatura → domscribe → MarkItDown → html2text 优先级解析
+6. **输出结果**：返回 Markdown 格式内容或统计信息
+
+### 基本语法
+
+```bash
+# 使用 lyxy-runner-python（推荐）
+uv run --with trafilatura --with domscribe --with markitdown --with html2text --with httpx --with pyppeteer --with selenium --with beautifulsoup4 scripts/parser.py https://example.com
+
+# 降级到直接执行
+python3 scripts/parser.py https://example.com
+```
+
+## References
+
+详细文档请参阅 `references/` 目录：
+
+| 文件 | 内容 |
+|------|------|
+| `references/examples.md` | URL 和 HTML 文件的完整提取、字数统计、标题提取、章节提取、搜索等示例 |
+| `references/parsers.md` | 解析器说明、依赖安装、各解析器输出特点、能力说明 |
+| `references/error-handling.md` | 限制说明、最佳实践、依赖执行策略 |
+
+> **详细用法**：请阅读 `scripts/README.md` 获取完整的命令行参数和依赖安装指南。
--- a/skills/lyxy-reader-html/references/error-handling.md
+++ b/skills/lyxy-reader-html/references/error-handling.md
@@ -0,0 +1,54 @@
+# 错误处理和限制说明
+
+## 限制
+
+- 不支持图片提取（仅纯文本）
+- 不支持复杂的格式保留（字体、颜色、布局等）
+- 不支持文档编辑或修改
+- 仅支持 URL、.html、.htm 格式
+- pyppeteer 和 selenium 需要额外配置环境变量
+
+## 最佳实践
+
+1. **必须优先使用 lyxy-runner-python**：如果环境中存在，必须使用 lyxy-runner-python 执行脚本
+2. **查阅 README**：详细参数、依赖安装、下载器/解析器对比等信息请阅读 `scripts/README.md`
+3. **JS 渲染网页**：对于需要 JS 渲染的网页，确保安装 pyppeteer 或 selenium 并正确配置环境变量
+4. **轻量使用**：如果目标网页不需要 JS 渲染，可以只安装 httpx/urllib 以获得更快的下载速度
+5. **禁止自动安装**：降级到直接 Python 执行时，仅向用户提示安装依赖，不得自动执行 pip install
+
+## 依赖执行策略
+
+### 必须使用 lyxy-runner-python
+
+如果环境中存在 lyxy-runner-python skill，**必须**使用它来执行 parser.py 脚本：
+- lyxy-runner-python 使用 uv 管理依赖，自动安装所需的第三方库
+- 环境隔离，不污染系统 Python
+- 跨平台兼容（Windows/macOS/Linux）
+
+### 降级到直接执行
+
+**仅当** lyxy-runner-python skill 不存在时，才降级到直接 Python 执行：
+- 需要用户手动安装依赖
+- 至少需要安装 html2text 和 beautifulsoup4
+- **禁止自动执行 pip install**，仅向用户提示安装建议
+
+## JS 渲染配置
+
+### pyppeteer 配置
+
+- 首次运行会自动下载 Chromium（需要网络连接）
+- 或设置 `LYXY_CHROMIUM_BINARY` 环境变量指定 Chromium/Chrome 可执行文件路径
+
+### selenium 配置
+
+必须设置两个环境变量：
+- `LYXY_CHROMIUM_DRIVER` - ChromeDriver 可执行文件路径
+- `LYXY_CHROMIUM_BINARY` - Chromium/Chrome 可执行文件路径
+
+## 不适用场景
+
+- 需要提取图片内容（仅支持纯文本）
+- 需要保留复杂的格式信息（字体、颜色、布局）
+- 需要编辑或修改文档
+- 需要登录或认证才能访问的网页（需自行处理 Cookie/Token）
+- 需要处理动态内容加载但不使用 JS 渲染的情况
--- a/skills/lyxy-reader-html/references/examples.md
+++ b/skills/lyxy-reader-html/references/examples.md
@@ -0,0 +1,59 @@
+# 示例
+
+## URL 输入 - 提取完整文档内容
+
+```bash
+# 使用 uv（推荐）
+uv run --with trafilatura --with domscribe --with markitdown --with html2text --with httpx --with beautifulsoup4 scripts/parser.py https://example.com
+
+# 直接使用 Python
+python scripts/parser.py https://example.com
+```
+
+## HTML 文件输入 - 提取完整文档内容
+
+```bash
+# 使用 uv（推荐）
+uv run --with trafilatura --with domscribe --with markitdown --with html2text --with beautifulsoup4 scripts/parser.py page.html
+
+# 直接使用 Python
+python scripts/parser.py page.html
+```
+
+## 获取文档字数
+
+```bash
+uv run --with trafilatura --with html2text --with beautifulsoup4 scripts/parser.py -c https://example.com
+```
+
+## 获取文档行数
+
+```bash
+uv run --with trafilatura --with html2text --with beautifulsoup4 scripts/parser.py -l https://example.com
+```
+
+## 提取所有标题
+
+```bash
+uv run --with trafilatura --with html2text --with beautifulsoup4 scripts/parser.py -t https://example.com
+```
+
+## 提取指定章节
+
+```bash
+uv run --with trafilatura --with html2text --with beautifulsoup4 scripts/parser.py -tc "关于我们" https://example.com
+```
+
+## 搜索关键词
+
+```bash
+uv run --with trafilatura --with html2text --with beautifulsoup4 scripts/parser.py -s "关键词" -n 3 https://example.com
+```
+
+## 降级到直接 Python 执行
+
+仅当 lyxy-runner-python skill 不存在时使用：
+
+```bash
+python3 scripts/parser.py https://example.com
+```
--- a/skills/lyxy-reader-html/references/parsers.md
+++ b/skills/lyxy-reader-html/references/parsers.md
@@ -0,0 +1,68 @@
+# 解析器说明和依赖安装
+
+## 多策略解析降级
+
+URL 下载器按 pyppeteer → selenium → httpx → urllib 优先级依次尝试；HTML 解析器按 trafilatura → domscribe → MarkItDown → html2text 优先级依次尝试。前一个失败自动回退到下一个。
+
+详细的优先级和对比请查阅 `scripts/README.md`。
+
+## 依赖安装
+
+### 使用 uv（推荐）
+
+```bash
+# 完整安装（所有下载器和解析器）
+uv run --with trafilatura --with domscribe --with markitdown --with html2text --with httpx --with pyppeteer --with selenium --with beautifulsoup4 scripts/parser.py https://example.com
+
+# 轻量安装（仅 httpx + html2text）
+uv run --with html2text --with beautifulsoup4 scripts/parser.py https://example.com
+```
+
+> **说明**：以上为推荐安装命令，包含所有组件以获得最佳兼容性。详细的优先级和对比请查阅 `scripts/README.md`。
+
+## 下载器对比
+
+| 下载器 | 优点 | 缺点 | 适用场景 |
+|--------|------|------|---------|
+| **pyppeteer** | 支持 JS 渲染；现代网页兼容性好 | 依赖重；首次需下载 Chromium | 需要 JS 渲染的现代网页 |
+| **selenium** | 支持 JS 渲染；成熟稳定 | 需配置 Chromium driver 和 binary | 需要 JS 渲染的现代网页 |
+| **httpx** | 轻量快速；现代 HTTP 客户端 | 不支持 JS 渲染 | 静态网页；快速下载 |
+| **urllib** | Python 标准库；无需安装 | 不支持 JS 渲染 | 静态网页；兜底方案 |
+
+## 解析器对比
+
+| 解析器 | 优点 | 缺点 | 适用场景 |
+|--------|------|------|---------|
+| **trafilatura** | 专门用于网页正文提取；输出质量高 | 可能无法提取某些页面 | 大多数网页正文提取 |
+| **domscribe** | 专注内容提取 | 相对较新 | 网页内容提取 |
+| **MarkItDown** | 微软官方；格式规范 | 输出较简洁 | 标准格式转换 |
+| **html2text** | 经典库；兼容性好 | 作为兜底方案 | 兜底解析 |
+
+## 能力说明
+
+### 1. URL / HTML 文件输入
+支持两种输入方式：
+- URL：自动下载网页内容（支持 JS 渲染）
+- 本地 HTML 文件：直接读取并解析
+
+### 2. 全文转换为 Markdown
+将完整 HTML 解析为 Markdown 格式，移除图片但保留文本格式（标题、列表、表格、粗体、斜体等）。
+
+### 3. HTML 预处理清理
+解析前自动清理 HTML：
+- 移除 script/style/link/svg 标签
+- 移除 href/src/srcset/action 等 URL 属性
+- 移除 style 属性
+
+### 4. 获取文档元信息
+- 字数统计（`-c` 参数）
+- 行数统计（`-l` 参数）
+
+### 5. 标题列表提取
+提取文档中所有 1-6 级标题（`-t` 参数），按原始层级关系返回。
+
+### 6. 指定章节内容提取
+根据标题名称提取特定章节的完整内容（`-tc` 参数），包含上级标题链和所有下级内容。
+
+### 7. 正则表达式搜索
+在文档中搜索关键词或模式（`-s` 参数），支持自定义上下文行数（`-n` 参数，默认 2 行）。
--- a/skills/lyxy-reader-html/scripts/README.md
+++ b/skills/lyxy-reader-html/scripts/README.md
@@ -0,0 +1,323 @@
+# HTML Parser 使用说明
+
+HTML/URL 解析器，将网页内容或本地 HTML 文件转换为 Markdown 格式。
+
+支持两种输入源：URL（自动下载网页内容）或本地 HTML 文件。下载器按 pyppeteer → selenium → httpx → urllib 优先级依次尝试；解析器按 trafilatura → domscribe → MarkItDown → html2text 优先级依次尝试。
+
+## 快速开始
+
+```bash
+# 最简运行（仅 urllib + html2text）
+python parser.py https://example.com
+
+# 安装推荐依赖后运行
+pip install trafilatura domscribe markitdown html2text httpx beautifulsoup4
+python parser.py https://example.com
+
+# 使用 uv 一键运行（自动安装依赖，无需手动 pip install）
+uv run --with trafilatura --with domscribe --with markitdown --with html2text --with httpx --with beautifulsoup4 parser.py https://example.com
+```
+
+## 命令行用法
+
+### 基本语法
+
+```bash
+python parser.py <input> [options]
+```
+
+`input` 可以是：
+- 以 `http://` 或 `https://` 开头的 URL
+- 本地 `.html` 或 `.htm` 文件路径
+
+不带任何选项时输出完整 Markdown 内容。
+
+### 参数说明
+
+以下参数互斥，一次只能使用一个：
+
+| 短选项 | 长选项 | 说明 |
+|--------|--------|------|
+| `-c` | `--count` | 输出解析后文档的总字符数（不含换行符） |
+| `-l` | `--lines` | 输出解析后文档的总行数 |
+| `-t` | `--titles` | 输出所有标题行（1-6 级，含 `#` 前缀） |
+| `-tc <name>` | `--title-content <name>` | 提取指定标题及其下级内容（`name` 不包含 `#` 号） |
+| `-s <pattern>` | `--search <pattern>` | 使用正则表达式搜索文档，返回匹配结果 |
+
+搜索辅助参数（与 `-s` 配合使用）：
+
+| 短选项 | 长选项 | 说明 |
+|--------|--------|------|
+| `-n <num>` | `--context <num>` | 每个匹配结果包含的前后非空行数（默认：2） |
+
+### 退出码
+
+| 退出码 | 含义 |
+|--------|------|
+| `0` | 解析成功 |
+| `1` | 错误（文件不存在、格式无效、所有下载器失败、所有解析器失败、标题未找到、正则无效或无匹配） |
+
+### 使用示例
+
+**URL 输入：**
+
+```bash
+# 输出完整 Markdown
+python parser.py https://example.com
+python parser.py https://example.com > output.md
+```
+
+**HTML 文件输入：**
+
+```bash
+# 输出完整 Markdown
+python parser.py page.html
+python parser.py page.html > output.md
+```
+
+**统计信息（`-c` / `-l`）：**
+
+输出单个数字，适合管道处理。
+
+```bash
+$ python parser.py https://example.com -c
+8500
+
+$ python parser.py https://example.com -l
+215
+```
+
+**提取标题（`-t`）：**
+
+每行一个标题，保留 `#` 前缀和层级。
+
+```bash
+$ python parser.py https://example.com -t
+# 欢迎来到 Example
+## 关于我们
+## 联系方式
+```
+
+**提取标题内容（`-tc`）：**
+
+输出指定标题及其下级内容。如果文档中有多个同名标题，用 `---` 分隔。每段输出包含上级标题链。
+
+```bash
+$ python parser.py https://example.com -tc "关于我们"
+# 欢迎来到 Example
+## 关于我们
+这是关于我们的详细内容...
+```
+
+**搜索（`-s`）：**
+
+支持 Python 正则表达式语法。多个匹配结果用 `---` 分隔。`-n` 控制上下文行数。
+
+```bash
+$ python parser.py https://example.com -s "测试" -n 1
+上一行内容
+包含**测试**关键词的行
+下一行内容
+---
+另一处上一行
+另一处**测试**内容
+另一处下一行
+```
+
+### 批量处理
+
+```bash
+# Linux/Mac - 批量处理 HTML 文件
+for file in *.html; do
+    python parser.py "$file" > "${file%.html}.md"
+done
+
+# Windows PowerShell - 批量处理 HTML 文件
+Get-ChildItem *.html | ForEach-Object {
+    python parser.py $_.FullName > ($_.BaseName + ".md")
+}
+```
+
+### 管道使用
+
+```bash
+# 过滤包含关键词的行
+python parser.py https://example.com | grep "重要" > important.md
+
+# 统计标题数量
+python parser.py https://example.com -t | wc -l
+```
+
+## 安装
+
+脚本基于 Python 3.6+ 运行。下载器和解析器按优先级依次尝试，建议安装所有依赖以获得最佳兼容性。也可以只安装部分依赖，脚本会自动选择可用的组件。
+
+### 完整安装（推荐）
+
+```bash
+# pip
+pip install trafilatura domscribe markitdown html2text httpx pyppeteer selenium beautifulsoup4
+
+# uv（一键运行，无需预安装）
+uv run --with trafilatura --with domscribe --with markitdown --with html2text --with httpx --with pyppeteer --with selenium --with beautifulsoup4 parser.py https://example.com
+```
+
+### 最小安装
+
+仅使用标准库（urllib + html2text）：
+
+```bash
+pip install html2text beautifulsoup4
+```
+
+### 各组件说明
+
+**下载器**：
+
+| 下载器 | 优点 | 缺点 | 依赖 |
+|--------|------|------|------|
+| pyppeteer | 支持 JS 渲染，现代网页兼容性好 | 依赖重，需下载 Chromium | pyppeteer |
+| selenium | 支持 JS 渲染，成熟稳定 | 需配置 Chromium driver 和 binary | selenium |
+| httpx | 轻量快速，现代 HTTP 客户端 | 不支持 JS 渲染 | httpx |
+| urllib | Python 标准库，无需安装 | 不支持 JS 渲染 | （无） |
+
+**解析器**：
+
+| 解析器 | 优点 | 缺点 | 依赖 |
+|--------|------|------|------|
+| trafilatura | 专门用于网页正文提取，质量高 | 可能无法提取某些页面 | trafilatura |
+| domscribe | 专注内容提取 | 相对较新 | domscribe |
+| MarkItDown | 微软官方，格式规范 | 输出较简洁 | markitdown |
+| html2text | 经典库，兼容性好 | 作为兜底方案 | html2text |
+
+**其他依赖**：
+
+- `beautifulsoup4` - HTML 清理必需
+
+### JS 渲染配置
+
+pyppeteer 和 selenium 支持 JS 渲染，但需要额外配置：
+
+**pyppeteer**：
+- 首次运行会自动下载 Chromium
+- 或设置 `LYXY_CHROMIUM_BINARY` 环境变量指定 Chromium 路径
+
+**selenium**：
+- 必须设置两个环境变量：
+  - `LYXY_CHROMIUM_DRIVER` - ChromeDriver 路径
+  - `LYXY_CHROMIUM_BINARY` - Chromium/Chrome 路径
+
+## 输出格式
+
+### Markdown 文档结构
+
+无选项时输出完整 Markdown，包含以下元素：
+
+```markdown
+# 一级标题
+
+正文段落
+
+## 二级标题
+
+- 无序列表项
+- 无序列表项
+
+1. 有序列表项
+2. 有序列表项
+
+| 列1 | 列2 | 列3 |
+|------|------|------|
+| 数据1 | 数据2 | 数据3 |
+
+**粗体** *斜体*
+```
+
+### 内容自动处理
+
+输出前会自动进行以下处理：
+
+| 处理 | 说明 |
+|------|------|
+| HTML 清理 | 移除 script/style/link/svg 标签和 URL 属性 |
+| 图片移除 | 删除 `![alt](url)` 语法 |
+| 空行规范化 | 连续多个空行合并为一个 |
+
+## 错误处理
+
+### 错误消息
+
+```bash
+# 输入不是 URL 也不是 HTML 文件
+$ python parser.py invalid.txt
+错误: 不是有效的 HTML 文件: invalid.txt
+
+# 文件不存在
+$ python parser.py missing.html
+错误: 文件不存在: missing.html
+
+# 所有下载器失败（URL 示例）
+$ python parser.py https://example.com
+所有下载方法均失败:
+- pyppeteer: pyppeteer 库未安装
+- selenium: selenium 库未安装
+- httpx: httpx 库未安装
+- urllib: HTTP 404
+
+# 所有解析器失败
+$ python parser.py page.html
+所有解析方法均失败:
+- trafilatura: trafilatura 库未安装
+- domscribe: domscribe 库未安装
+- MarkItDown: MarkItDown 库未安装
+- html2text: html2text 库未安装
+
+# 标题未找到
+$ python parser.py https://example.com -tc "不存在的标题"
+错误: 未找到标题 '不存在的标题'
+
+# 无效正则或无匹配
+$ python parser.py https://example.com -s "[invalid"
+错误: 正则表达式无效或未找到匹配: '[invalid'
+```
+
+### 解析器回退机制
+
+脚本按优先级依次尝试各下载器和解析器。每个组件失败后记录原因（库未安装 / 下载失败 / 解析失败 / 文档为空），然后自动尝试下一个。全部失败时输出汇总信息并以退出码 1 退出。
+
+## 常见问题
+
+### 为什么有些内容没有提取到？
+
+不同解析器输出详细度不同。建议安装所有解析器依赖，脚本会自动选择优先级最高的可用解析器。
+
+### 如何只使用特定下载器/解析器？
+
+当前版本不支持指定，总是按优先级自动选择。可以通过只安装目标组件的依赖来间接控制——未安装的组件会被跳过。
+
+### URL 下载慢或被反爬？
+
+pyppeteer 和 selenium 支持 JS 渲染但较慢。如果目标网页不需要 JS 渲染，可以只安装 httpx 或 urllib，让脚本回退到这些轻量下载器。
+
+### 中文显示乱码？
+
+脚本输出 UTF-8 编码，确保终端支持：
+
+```bash
+# Linux/Mac
+export LANG=en_US.UTF-8
+
+# Windows PowerShell
+[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
+```
+
+## 文件结构
+
+```
+scripts/
+├── common.py         # 公共函数（HTML 清理、Markdown 处理）
+├── downloader.py     # URL 下载模块
+├── html_parser.py    # HTML 解析模块
+├── parser.py         # 命令行入口
+└── README.md         # 本文档
+```
--- a/skills/lyxy-reader-html/scripts/common.py
+++ b/skills/lyxy-reader-html/scripts/common.py
@@ -0,0 +1,225 @@
+#!/usr/bin/env python3
+"""HTML 解析器的公共模块，包含 HTML 清理、Markdown 处理等工具函数。"""
+
+import re
+from typing import List, Optional
+from bs4 import BeautifulSoup
+
+IMAGE_PATTERN = re.compile(r"!\[[^\]]*\]\([^)]+\)")
+_CONSECUTIVE_BLANK_LINES = re.compile(r"\n{3,}")
+
+
+def clean_html_content(html_content: str) -> str:
+    """清理 HTML 内容，移除 script/style/link/svg 标签和 URL 属性。"""
+    soup = BeautifulSoup(html_content, "html.parser")
+
+    # Remove all script tags
+    for script in soup.find_all("script"):
+        script.decompose()
+
+    # Remove all style tags
+    for style in soup.find_all("style"):
+        style.decompose()
+
+    # Remove all svg tags
+    for svg in soup.find_all("svg"):
+        svg.decompose()
+
+    # Remove all link tags
+    for link in soup.find_all("link"):
+        link.decompose()
+
+    # Remove URLs from href and src attributes
+    for tag in soup.find_all(True):
+        if "href" in tag.attrs:
+            del tag["href"]
+        if "src" in tag.attrs:
+            del tag["src"]
+        if "srcset" in tag.attrs:
+            del tag["srcset"]
+        if "action" in tag.attrs:
+            del tag["action"]
+        data_attrs = [
+            attr
+            for attr in tag.attrs
+            if attr.startswith("data-") and "src" in attr.lower()
+        ]
+        for attr in data_attrs:
+            del tag[attr]
+
+    # Remove all style attributes from all tags
+    for tag in soup.find_all(True):
+        if "style" in tag.attrs:
+            del tag["style"]
+
+    # Remove data-href attributes
+    for tag in soup.find_all(True):
+        if "data-href" in tag.attrs:
+            del tag["data-href"]
+
+    # Remove URLs from title attributes
+    for tag in soup.find_all(True):
+        if "title" in tag.attrs:
+            title = tag["title"]
+            cleaned_title = re.sub(r"https?://\S+", "", title, flags=re.IGNORECASE)
+            tag["title"] = cleaned_title
+
+    # Remove class attributes that contain URL-like patterns
+    for tag in soup.find_all(True):
+        if "class" in tag.attrs:
+            classes = tag["class"]
+            cleaned_classes = [c for c in classes if not c.startswith("url ") and not "hyperlink-href:" in c]
+            tag["class"] = cleaned_classes
+
+    return str(soup)
+
+
+def remove_markdown_images(markdown_text: str) -> str:
+    """移除 Markdown 文本中的图片标记。"""
+    return IMAGE_PATTERN.sub("", markdown_text)
+
+
+def normalize_markdown_whitespace(content: str) -> str:
+    """规范化 Markdown 空白字符，保留单行空行。"""
+    return _CONSECUTIVE_BLANK_LINES.sub("\n\n", content)
+
+
+def get_heading_level(line: str) -> int:
+    """获取 Markdown 行的标题级别（1-6），非标题返回 0。"""
+    stripped = line.lstrip()
+    if not stripped.startswith("#"):
+        return 0
+    without_hash = stripped.lstrip("#")
+    level = len(stripped) - len(without_hash)
+    if not (1 <= level <= 6):
+        return 0
+    if len(stripped) == level:
+        return level
+    if stripped[level] != " ":
+        return 0
+    return level
+
+
+def extract_titles(markdown_text: str) -> List[str]:
+    """提取 markdown 文本中的所有标题行（1-6级）。"""
+    title_lines = []
+    for line in markdown_text.split("\n"):
+        if get_heading_level(line) > 0:
+            title_lines.append(line.lstrip())
+    return title_lines
+
+
+def extract_title_content(markdown_text: str, title_name: str) -> Optional[str]:
+    """提取所有指定标题及其下级内容（每个包含上级标题）。"""
+    lines = markdown_text.split("\n")
+    match_indices = []
+
+    for i, line in enumerate(lines):
+        level = get_heading_level(line)
+        if level > 0:
+            stripped = line.lstrip()
+            title_text = stripped[level:].strip()
+            if title_text == title_name:
+                match_indices.append(i)
+
+    if not match_indices:
+        return None
+
+    result_lines = []
+    for match_num, idx in enumerate(match_indices):
+        if match_num > 0:
+            result_lines.append("\n---\n")
+
+        target_level = get_heading_level(lines[idx])
+
+        parent_titles = []
+        current_level = target_level
+        for i in range(idx - 1, -1, -1):
+            line_level = get_heading_level(lines[i])
+            if line_level > 0 and line_level < current_level:
+                parent_titles.append(lines[i])
+                current_level = line_level
+                if current_level == 1:
+                    break
+
+        parent_titles.reverse()
+        result_lines.extend(parent_titles)
+
+        result_lines.append(lines[idx])
+        for i in range(idx + 1, len(lines)):
+            line = lines[i]
+            line_level = get_heading_level(line)
+            if line_level == 0 or line_level > target_level:
+                result_lines.append(line)
+            else:
+                break
+
+    return "\n".join(result_lines)
+
+
+def search_markdown(
+    content: str, pattern: str, context_lines: int = 0
+) -> Optional[str]:
+    """使用正则表达式搜索 markdown 文档，返回匹配结果及其上下文。"""
+    try:
+        regex = re.compile(pattern)
+    except re.error:
+        return None
+
+    lines = content.split("\n")
+
+    non_empty_indices = []
+    non_empty_to_original = {}
+    for i, line in enumerate(lines):
+        if line.strip():
+            non_empty_indices.append(i)
+            non_empty_to_original[i] = len(non_empty_indices) - 1
+
+    matched_non_empty_indices = []
+    for orig_idx in non_empty_indices:
+        if regex.search(lines[orig_idx]):
+            matched_non_empty_indices.append(non_empty_to_original[orig_idx])
+
+    if not matched_non_empty_indices:
+        return None
+
+    merged_ranges = []
+    current_start = matched_non_empty_indices[0]
+    current_end = matched_non_empty_indices[0]
+
+    for idx in matched_non_empty_indices[1:]:
+        if idx - current_end <= context_lines * 2:
+            current_end = idx
+        else:
+            merged_ranges.append((current_start, current_end))
+            current_start = idx
+            current_end = idx
+    merged_ranges.append((current_start, current_end))
+
+    results = []
+    for start, end in merged_ranges:
+        context_start_idx = max(0, start - context_lines)
+        context_end_idx = min(len(non_empty_indices) - 1, end + context_lines)
+
+        start_line_idx = non_empty_indices[context_start_idx]
+        end_line_idx = non_empty_indices[context_end_idx]
+
+        result_lines = [
+            line
+            for i, line in enumerate(lines)
+            if start_line_idx <= i <= end_line_idx
+        ]
+        results.append("\n".join(result_lines))
+
+    return "\n---\n".join(results)
+
+
+def is_url(input_str: str) -> bool:
+    """判断输入是否为 URL。"""
+    return input_str.startswith("http://") or input_str.startswith("https://")
+
+
+def is_html_file(file_path: str) -> bool:
+    """判断文件是否为 HTML 文件（仅检查扩展名）。"""
+    ext = file_path.lower()
+    return ext.endswith(".html") or ext.endswith(".htm")
--- a/skills/lyxy-reader-html/scripts/downloader.py
+++ b/skills/lyxy-reader-html/scripts/downloader.py
@@ -0,0 +1,263 @@
+#!/usr/bin/env python3
+"""URL 下载模块，按 pyppeteer → selenium → httpx → urllib 优先级尝试下载。"""
+
+import os
+import asyncio
+import tempfile
+import urllib.request
+import urllib.error
+from typing import Optional, Tuple
+
+
+# 公共配置
+USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
+WINDOW_SIZE = "1920,1080"
+LANGUAGE_SETTING = "zh-CN,zh"
+
+# Chrome 浏览器启动参数（pyppeteer 和 selenium 共用）
+CHROME_ARGS = [
+    "--no-sandbox",
+    "--disable-dev-shm-usage",
+    "--disable-gpu",
+    "--disable-software-rasterizer",
+    "--disable-extensions",
+    "--disable-background-networking",
+    "--disable-default-apps",
+    "--disable-sync",
+    "--disable-translate",
+    "--hide-scrollbars",
+    "--metrics-recording-only",
+    "--mute-audio",
+    "--no-first-run",
+    "--safebrowsing-disable-auto-update",
+    "--blink-settings=imagesEnabled=false",
+    "--disable-plugins",
+    "--disable-ipc-flooding-protection",
+    "--disable-renderer-backgrounding",
+    "--disable-background-timer-throttling",
+    "--disable-hang-monitor",
+    "--disable-prompt-on-repost",
+    "--disable-client-side-phishing-detection",
+    "--disable-component-update",
+    "--disable-domain-reliability",
+    "--disable-features=site-per-process",
+    "--disable-features=IsolateOrigins",
+    "--disable-features=VizDisplayCompositor",
+    "--disable-features=WebRTC",
+    f"--window-size={WINDOW_SIZE}",
+    f"--lang={LANGUAGE_SETTING}",
+    f"--user-agent={USER_AGENT}",
+]
+
+# 隐藏自动化特征的脚本（pyppeteer 和 selenium 共用）
+HIDE_AUTOMATION_SCRIPT = """
+    () => {
+        Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
+        Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3, 4, 5] });
+        Object.defineProperty(navigator, 'languages', { get: () => ['zh-CN', 'zh'] });
+    }
+"""
+
+# pyppeteer 额外的隐藏自动化脚本（包含 notifications 处理）
+HIDE_AUTOMATION_SCRIPT_PUPPETEER = """
+    () => {
+        Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
+        Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3, 4, 5] });
+        Object.defineProperty(navigator, 'languages', { get: () => ['zh-CN', 'zh'] });
+        const originalQuery = window.navigator.permissions.query;
+        window.navigator.permissions.query = (parameters) => (
+            parameters.name === 'notifications' ?
+                Promise.resolve({ state: Notification.permission }) :
+                originalQuery(parameters)
+        );
+    }
+"""
+
+
+def download_with_pyppeteer(url: str) -> Tuple[Optional[str], Optional[str]]:
+    """使用 pyppeteer 下载 URL（支持 JS 渲染）。"""
+    try:
+        from pyppeteer import launch
+    except ImportError:
+        return None, "pyppeteer 库未安装"
+
+    async def _download():
+        pyppeteer_temp_dir = os.path.join(tempfile.gettempdir(), "pyppeteer_home")
+        chromium_path = os.environ.get("LYXY_CHROMIUM_BINARY")
+        if not chromium_path:
+            os.environ["PYPPETEER_HOME"] = pyppeteer_temp_dir
+        executable_path = chromium_path if (chromium_path and os.path.exists(chromium_path)) else None
+
+        browser = None
+        try:
+            browser = await launch(
+                headless=True,
+                executablePath=executable_path,
+                args=CHROME_ARGS
+            )
+            page = await browser.newPage()
+
+            await page.evaluateOnNewDocument(HIDE_AUTOMATION_SCRIPT_PUPPETEER)
+
+            await page.setJavaScriptEnabled(True)
+            await page.goto(url, {"waitUntil": "networkidle2", "timeout": 30000})
+            return await page.content()
+        finally:
+            if browser is not None:
+                try:
+                    await browser.close()
+                except Exception:
+                    pass
+
+    try:
+        content = asyncio.run(_download())
+        if not content or not content.strip():
+            return None, "下载内容为空"
+        return content, None
+    except Exception as e:
+        return None, f"pyppeteer 下载失败: {str(e)}"
+
+
+def download_with_selenium(url: str) -> Tuple[Optional[str], Optional[str]]:
+    """使用 selenium 下载 URL（支持 JS 渲染）。"""
+    try:
+        from selenium import webdriver
+        from selenium.webdriver.chrome.service import Service
+        from selenium.webdriver.chrome.options import Options
+        from selenium.webdriver.support.ui import WebDriverWait
+    except ImportError:
+        return None, "selenium 库未安装"
+
+    driver_path = os.environ.get("LYXY_CHROMIUM_DRIVER")
+    binary_path = os.environ.get("LYXY_CHROMIUM_BINARY")
+
+    if not driver_path or not os.path.exists(driver_path):
+        return None, "LYXY_CHROMIUM_DRIVER 环境变量未设置或文件不存在"
+    if not binary_path or not os.path.exists(binary_path):
+        return None, "LYXY_CHROMIUM_BINARY 环境变量未设置或文件不存在"
+
+    chrome_options = Options()
+    chrome_options.binary_location = binary_path
+    chrome_options.add_argument("--headless=new")
+    for arg in CHROME_ARGS:
+        chrome_options.add_argument(arg)
+
+    # 隐藏自动化特征
+    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
+    chrome_options.add_experimental_option("useAutomationExtension", False)
+
+    driver = None
+    try:
+        import time
+        service = Service(driver_path)
+        driver = webdriver.Chrome(service=service, options=chrome_options)
+
+        # 隐藏 webdriver 属性
+        driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
+            "source": HIDE_AUTOMATION_SCRIPT
+        })
+
+        driver.get(url)
+
+        # 等待页面内容稳定
+        WebDriverWait(driver, 30).until(
+            lambda d: d.execute_script("return document.readyState") == "complete"
+        )
+
+        last_len = 0
+        stable_count = 0
+        for _ in range(30):
+            current_len = len(driver.page_source)
+            if current_len == last_len:
+                stable_count += 1
+                if stable_count >= 2:
+                    break
+            else:
+                stable_count = 0
+                last_len = current_len
+            time.sleep(0.5)
+
+        content = driver.page_source
+        if not content or not content.strip():
+            return None, "下载内容为空"
+        return content, None
+    except Exception as e:
+        return None, f"selenium 下载失败: {str(e)}"
+    finally:
+        if driver is not None:
+            try:
+                driver.quit()
+            except Exception:
+                pass
+
+
+def download_with_httpx(url: str) -> Tuple[Optional[str], Optional[str]]:
+    """使用 httpx 下载 URL（轻量级 HTTP 客户端）。"""
+    try:
+        import httpx
+    except ImportError:
+        return None, "httpx 库未安装"
+
+    headers = {
+        "User-Agent": USER_AGENT
+    }
+
+    try:
+        with httpx.Client(timeout=30.0) as client:
+            response = client.get(url, headers=headers)
+            if response.status_code == 200:
+                content = response.text
+                if not content or not content.strip():
+                    return None, "下载内容为空"
+                return content, None
+            return None, f"HTTP {response.status_code}"
+    except Exception as e:
+        return None, f"httpx 下载失败: {str(e)}"
+
+
+def download_with_urllib(url: str) -> Tuple[Optional[str], Optional[str]]:
+    """使用 urllib 下载 URL（标准库，兜底方案）。"""
+    headers = {
+        "User-Agent": USER_AGENT
+    }
+
+    try:
+        req = urllib.request.Request(url, headers=headers)
+        with urllib.request.urlopen(req, timeout=30) as response:
+            if response.status == 200:
+                content = response.read().decode("utf-8")
+                if not content or not content.strip():
+                    return None, "下载内容为空"
+                return content, None
+            return None, f"HTTP {response.status}"
+    except Exception as e:
+        return None, f"urllib 下载失败: {str(e)}"
+
+
+def download_html(url: str) -> Tuple[Optional[str], List[str]]:
+    """
+    统一的 HTML 下载入口函数，按优先级尝试各下载器。
+
+    返回: (content, failures)
+    - content: 成功时返回 HTML 内容，失败时返回 None
+    - failures: 各下载器的失败原因列表
+    """
+    failures = []
+    content = None
+
+    # 按优先级尝试各下载器
+    downloaders = [
+        ("pyppeteer", download_with_pyppeteer),
+        ("selenium", download_with_selenium),
+        ("httpx", download_with_httpx),
+        ("urllib", download_with_urllib),
+    ]
+
+    for name, func in downloaders:
+        content, error = func(url)
+        if content is not None:
+            return content, failures
+        else:
+            failures.append(f"- {name}: {error}")
+
+    return None, failures
--- a/skills/lyxy-reader-html/scripts/html_parser.py
+++ b/skills/lyxy-reader-html/scripts/html_parser.py
@@ -0,0 +1,140 @@
+#!/usr/bin/env python3
+"""HTML 解析模块，按 trafilatura → domscribe → MarkItDown → html2text 优先级尝试解析。"""
+
+from typing import Optional, Tuple
+
+
+def parse_with_trafilatura(html_content: str) -> Tuple[Optional[str], Optional[str]]:
+    """使用 trafilatura 解析 HTML。"""
+    try:
+        import trafilatura
+    except ImportError:
+        return None, "trafilatura 库未安装"
+
+    try:
+        markdown_content = trafilatura.extract(
+            html_content,
+            output_format="markdown",
+            include_formatting=True,
+            include_links=True,
+            include_images=False,
+            include_tables=True,
+            favor_recall=True,
+            include_comments=True,
+        )
+        if markdown_content is None:
+            return None, "trafilatura 返回 None"
+        if not markdown_content.strip():
+            return None, "解析内容为空"
+        return markdown_content, None
+    except Exception as e:
+        return None, f"trafilatura 解析失败: {str(e)}"
+
+
+def parse_with_domscribe(html_content: str) -> Tuple[Optional[str], Optional[str]]:
+    """使用 domscribe 解析 HTML。"""
+    try:
+        from domscribe import html_to_markdown
+    except ImportError:
+        return None, "domscribe 库未安装"
+
+    try:
+        options = {
+            'extract_main_content': True,
+        }
+        markdown_content = html_to_markdown(html_content, options)
+        if not markdown_content.strip():
+            return None, "解析内容为空"
+        return markdown_content, None
+    except Exception as e:
+        return None, f"domscribe 解析失败: {str(e)}"
+
+
+def parse_with_markitdown(html_content: str, temp_file_path: Optional[str] = None) -> Tuple[Optional[str], Optional[str]]:
+    """使用 MarkItDown 解析 HTML。"""
+    try:
+        from markitdown import MarkItDown
+    except ImportError:
+        return None, "MarkItDown 库未安装"
+
+    try:
+        import tempfile
+        import os
+
+        input_path = temp_file_path
+        if not input_path or not os.path.exists(input_path):
+            # 创建临时文件
+            fd, input_path = tempfile.mkstemp(suffix='.html')
+            with os.fdopen(fd, 'w', encoding='utf-8') as f:
+                f.write(html_content)
+
+        md = MarkItDown()
+        result = md.convert(
+            input_path,
+            heading_style="ATX",
+            strip=["img", "script", "style", "noscript"],
+        )
+        markdown_content = result.text_content
+
+        if not temp_file_path:
+            try:
+                os.unlink(input_path)
+            except Exception:
+                pass
+
+        if not markdown_content.strip():
+            return None, "解析内容为空"
+        return markdown_content, None
+    except Exception as e:
+        return None, f"MarkItDown 解析失败: {str(e)}"
+
+
+def parse_with_html2text(html_content: str) -> Tuple[Optional[str], Optional[str]]:
+    """使用 html2text 解析 HTML（兜底方案）。"""
+    try:
+        import html2text
+    except ImportError:
+        return None, "html2text 库未安装"
+
+    try:
+        converter = html2text.HTML2Text()
+        converter.ignore_emphasis = False
+        converter.ignore_links = False
+        converter.ignore_images = True
+        converter.body_width = 0
+        converter.skip_internal_links = True
+        markdown_content = converter.handle(html_content)
+        if not markdown_content.strip():
+            return None, "解析内容为空"
+        return markdown_content, None
+    except Exception as e:
+        return None, f"html2text 解析失败: {str(e)}"
+
+
+def parse_html(html_content: str, temp_file_path: Optional[str] = None) -> Tuple[Optional[str], List[str]]:
+    """
+    统一的 HTML 解析入口函数，按优先级尝试各解析器。
+
+    返回: (content, failures)
+    - content: 成功时返回 Markdown 内容，失败时返回 None
+    - failures: 各解析器的失败原因列表
+    """
+    failures = []
+    content = None
+
+    # 按优先级尝试各解析器
+    parsers = [
+        ("trafilatura", lambda c: parse_with_trafilatura(c)),
+        ("domscribe", lambda c: parse_with_domscribe(c)),
+        ("MarkItDown", lambda c: parse_with_markitdown(c, temp_file_path)),
+        ("html2text", lambda c: parse_with_html2text(c)),
+    ]
+
+    for name, func in parsers:
+        content, error = func(html_content)
+        if content is not None:
+            return content, failures
+        else:
+            failures.append(f"- {name}: {error}")
+
+    return None, failures
--- a/skills/lyxy-reader-html/scripts/parser.py
+++ b/skills/lyxy-reader-html/scripts/parser.py
@@ -0,0 +1,128 @@
+#!/usr/bin/env python3
+"""HTML 解析器命令行交互模块，提供命令行接口。支持 URL 和 HTML 文件。"""
+
+import argparse
+import logging
+import os
+import sys
+import warnings
+
+# 抑制第三方库的进度条和日志，仅保留解析结果输出
+os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
+os.environ["HF_HUB_DISABLE_TELEMETRY"] = "1"
+os.environ["TQDM_DISABLE"] = "1"
+warnings.filterwarnings("ignore")
+logging.disable(logging.WARNING)
+
+import common
+import downloader
+import html_parser
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(
+        description="将 URL 或 HTML 文件解析为 Markdown"
+    )
+
+    parser.add_argument("input", help="URL 或 HTML 文件的路径")
+
+    parser.add_argument(
+        "-n",
+        "--context",
+        type=int,
+        default=2,
+        help="与 -s 配合使用，指定每个检索结果包含的前后行数（不包含空行）",
+    )
+
+    group = parser.add_mutually_exclusive_group()
+    group.add_argument(
+        "-c", "--count", action="store_true", help="返回解析后的 markdown 文档的总字数"
+    )
+    group.add_argument(
+        "-l", "--lines", action="store_true", help="返回解析后的 markdown 文档的总行数"
+    )
+    group.add_argument(
+        "-t",
+        "--titles",
+        action="store_true",
+        help="返回解析后的 markdown 文档的标题行（1-6级）",
+    )
+    group.add_argument(
+        "-tc",
+        "--title-content",
+        help="指定标题名称，输出该标题及其下级内容（不包含#号）",
+    )
+    group.add_argument(
+        "-s",
+        "--search",
+        help="使用正则表达式搜索文档，返回所有匹配结果（用---分隔）",
+    )
+
+    args = parser.parse_args()
+
+    # 判断输入类型
+    html_content = None
+    temp_file_path = None
+
+    if common.is_url(args.input):
+        # URL 模式
+        html_content, download_failures = downloader.download_html(args.input)
+        if html_content is None:
+            print("所有下载方法均失败:")
+            for failure in download_failures:
+                print(failure)
+            sys.exit(1)
+    else:
+        # HTML 文件模式
+        if not os.path.exists(args.input):
+            print(f"错误: 文件不存在: {args.input}")
+            sys.exit(1)
+        if not common.is_html_file(args.input):
+            print(f"错误: 不是有效的 HTML 文件: {args.input}")
+            sys.exit(1)
+        with open(args.input, "r", encoding="utf-8") as f:
+            html_content = f.read()
+        temp_file_path = args.input
+
+    # HTML 预处理清理
+    cleaned_html = common.clean_html_content(html_content)
+
+    # 解析 HTML
+    markdown_content, parse_failures = html_parser.parse_html(cleaned_html, temp_file_path)
+    if markdown_content is None:
+        print("所有解析方法均失败:")
+        for failure in parse_failures:
+            print(failure)
+        sys.exit(1)
+
+    # Markdown 后处理
+    markdown_content = common.remove_markdown_images(markdown_content)
+    markdown_content = common.normalize_markdown_whitespace(markdown_content)
+
+    # 根据参数输出
+    if args.count:
+        print(len(markdown_content.replace("\n", "")))
+    elif args.lines:
+        print(len(markdown_content.split("\n")))
+    elif args.titles:
+        titles = common.extract_titles(markdown_content)
+        for title in titles:
+            print(title)
+    elif args.title_content:
+        title_content = common.extract_title_content(markdown_content, args.title_content)
+        if title_content is None:
+            print(f"错误: 未找到标题 '{args.title_content}'")
+            sys.exit(1)
+        print(title_content, end="")
+    elif args.search:
+        search_result = common.search_markdown(markdown_content, args.search, args.context)
+        if search_result is None:
+            print(f"错误: 正则表达式无效或未找到匹配: '{args.search}'")
+            sys.exit(1)
+        print(search_result, end="")
+    else:
+        print(markdown_content, end="")
+
+
+if __name__ == "__main__":
+    main()