创建 lyxy-reader-html skill

- 新增 skill: lyxy-reader-html，用于解析 HTML 文件和 URL 网页内容 - 支持 URL 下载（pyppeteer → selenium → httpx → urllib 优先级回退） - 支持 HTML 解析（trafilatura → domscribe → MarkItDown → html2text 优先级回退） - 支持查询功能：全文提取、字数统计、行数统计、标题提取、章节提取、正则搜索 - 新增 spec: html-document-parsing - 归档 change: create-lyxy-reader-html-skill
2026-03-08 02:02:03 +08:00
parent 0bd9ec8a36
commit 6b4fcf2647
16 changed files with 1827 additions and 3 deletions
--- a/openspec/changes/archive/2026-03-08-create-lyxy-reader-html-skill/tasks.md
+++ b/openspec/changes/archive/2026-03-08-create-lyxy-reader-html-skill/tasks.md
@@ -0,0 +1,58 @@
+## 1. 初始化 Skill 目录结构
+
+- [x] 1.1 创建 `skills/lyxy-reader-html/` 目录
+- [x] 1.2 创建 `skills/lyxy-reader-html/scripts/` 子目录
+- [x] 1.3 创建 `skills/lyxy-reader-html/references/` 子目录
+
+## 2. 创建 SKILL.md 主文档
+
+- [x] 2.1 编写 YAML 前置元数据（name、description、compatibility）
+- [x] 2.2 编写 Purpose 章节
+- [x] 2.3 编写 When to Use 章节（含触发词）
+- [x] 2.4 编写 Quick Reference 章节（参数表）
+- [x] 2.5 编写 Workflow 章节
+- [x] 2.6 编写 References 章节
+
+## 3. 实现 common.py 公共模块
+
+- [x] 3.1 实现 HTML 清理函数 `clean_html_content()`
+- [x] 3.2 实现 Markdown 图片移除函数 `remove_markdown_images()`
+- [x] 3.3 实现 Markdown 空行规范化函数 `normalize_markdown_whitespace()`
+- [x] 3.4 实现标题级别检测函数 `get_heading_level()`
+- [x] 3.5 实现标题提取函数 `extract_titles()`
+- [x] 3.6 实现章节内容提取函数 `extract_title_content()`
+- [x] 3.7 实现正则搜索函数 `search_markdown()`
+
+## 4. 实现 downloader.py URL 下载模块
+
+- [x] 4.1 实现 `download_with_pyppeteer()` 函数
+- [x] 4.2 实现 `download_with_selenium()` 函数
+- [x] 4.3 实现 `download_with_httpx()` 函数
+- [x] 4.4 实现 `download_with_urllib()` 函数
+- [x] 4.5 实现统一的 `download_html()` 入口函数，按优先级尝试各下载器
+
+## 5. 实现 html_parser.py HTML 解析模块
+
+- [x] 5.1 实现 `parse_with_trafilatura()` 函数
+- [x] 5.2 实现 `parse_with_domscribe()` 函数
+- [x] 5.3 实现 `parse_with_markitdown()` 函数
+- [x] 5.4 实现 `parse_with_html2text()` 函数
+- [x] 5.5 实现统一的 `parse_html()` 入口函数，按优先级尝试各解析器
+
+## 6. 实现 parser.py 命令行入口
+
+- [x] 6.1 实现命令行参数解析（argparse）
+- [x] 6.2 实现输入源判断（URL / HTML 文件）
+- [x] 6.3 实现 URL 下载流程（如需要）
+- [x] 6.4 实现 HTML 清理流程
+- [x] 6.5 实现 HTML 解析流程
+- [x] 6.6 实现 Markdown 后处理（移除图片、规范化空行）
+- [x] 6.7 实现各查询模式（全文、字数、行数、标题、章节、搜索）
+- [x] 6.8 实现错误处理和退出码
+
+## 7. 创建参考文档
+
+- [x] 7.1 创建 `scripts/README.md` 详细使用文档
+- [x] 7.2 创建 `references/examples.md` 使用示例
+- [x] 7.3 创建 `references/parsers.md` 解析器说明
+- [x] 7.4 创建 `references/error-handling.md` 错误处理指南