lyxy-document/scripts/readers/html/trafilatura.py at master

Files

lanyuanxiaoyao 2b81dd49fe refactor: 统一 HTML Reader 的 parse 签名，使用文件路径参数

将所有 HTML Parser 的函数签名从接收 HTML 字符串改为接收文件路径，
与其他 Reader（PDF、DOCX 等）保持一致。

主要变更：
- 修改 PARSERS 列表，移除 lambda 表达式，直接传递函数引用
- 在 HtmlReader.parse() 中统一管理临时文件（UTF-8 编码）
- 每个 Parser 使用独立的临时文件副本，用完即清理
- 移除 download_and_parse() 方法，逻辑合并到 parse() 中
- 更新相关测试，改为直接传递文件路径

受影响的 Parser：
- trafilatura.parse(html_content) -> parse(file_path)
- domscribe.parse(html_content) -> parse(file_path)
- markitdown.parse(html_content, temp_file_path) -> parse(file_path)
- html2text.parse(html_content) -> parse(file_path)

2026-03-09 00:05:23 +08:00

1.2 KiB

Raw Permalink Blame History

View Raw

1.2 KiB Raw Permalink Blame History

1.2 KiB

Raw Permalink Blame History