Files

lanyuanxiaoyao 6b4fcf2647 创建 lyxy-reader-html skill

- 新增 skill: lyxy-reader-html，用于解析 HTML 文件和 URL 网页内容
- 支持 URL 下载（pyppeteer → selenium → httpx → urllib 优先级回退）
- 支持 HTML 解析（trafilatura → domscribe → MarkItDown → html2text 优先级回退）
- 支持查询功能：全文提取、字数统计、行数统计、标题提取、章节提取、正则搜索
- 新增 spec: html-document-parsing
- 归档 change: create-lyxy-reader-html-skill

2026-03-08 02:02:03 +08:00

1.4 KiB

Raw Blame History

示例

URL 输入 - 提取完整文档内容

# 使用 uv（推荐）
uv run --with trafilatura --with domscribe --with markitdown --with html2text --with httpx --with beautifulsoup4 scripts/parser.py https://example.com

# 直接使用 Python
python scripts/parser.py https://example.com

HTML 文件输入 - 提取完整文档内容

# 使用 uv（推荐）
uv run --with trafilatura --with domscribe --with markitdown --with html2text --with beautifulsoup4 scripts/parser.py page.html

# 直接使用 Python
python scripts/parser.py page.html

获取文档字数

uv run --with trafilatura --with html2text --with beautifulsoup4 scripts/parser.py -c https://example.com

获取文档行数

uv run --with trafilatura --with html2text --with beautifulsoup4 scripts/parser.py -l https://example.com

提取所有标题

uv run --with trafilatura --with html2text --with beautifulsoup4 scripts/parser.py -t https://example.com

提取指定章节

uv run --with trafilatura --with html2text --with beautifulsoup4 scripts/parser.py -tc "关于我们" https://example.com

搜索关键词

uv run --with trafilatura --with html2text --with beautifulsoup4 scripts/parser.py -s "关键词" -n 3 https://example.com

降级到直接 Python 执行

仅当 lyxy-runner-python skill 不存在时使用：

python3 scripts/parser.py https://example.com

1.4 KiB Raw Blame History Unescape Escape

示例

URL 输入 - 提取完整文档内容

HTML 文件输入 - 提取完整文档内容

获取文档字数

获取文档行数

提取所有标题

提取指定章节

搜索关键词

降级到直接 Python 执行

1.4 KiB

Raw Blame History