Files
lyxy-document/scripts/readers/html/html2text.py
lanyuanxiaoyao 2b81dd49fe refactor: 统一 HTML Reader 的 parse 签名,使用文件路径参数
将所有 HTML Parser 的函数签名从接收 HTML 字符串改为接收文件路径,
与其他 Reader(PDF、DOCX 等)保持一致。

主要变更:
- 修改 PARSERS 列表,移除 lambda 表达式,直接传递函数引用
- 在 HtmlReader.parse() 中统一管理临时文件(UTF-8 编码)
- 每个 Parser 使用独立的临时文件副本,用完即清理
- 移除 download_and_parse() 方法,逻辑合并到 parse() 中
- 更新相关测试,改为直接传递文件路径

受影响的 Parser:
- trafilatura.parse(html_content) -> parse(file_path)
- domscribe.parse(html_content) -> parse(file_path)
- markitdown.parse(html_content, temp_file_path) -> parse(file_path)
- html2text.parse(html_content) -> parse(file_path)
2026-03-09 00:05:23 +08:00

34 lines
1.1 KiB
Python
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""使用 html2text 解析 HTML兜底方案"""
from typing import Optional, Tuple
def parse(file_path: str) -> Tuple[Optional[str], Optional[str]]:
"""使用 html2text 解析 HTML 文件(兜底方案)"""
try:
import html2text
except ImportError:
return None, "html2text 库未安装"
try:
with open(file_path, 'r', encoding='utf-8') as f:
html_content = f.read()
except FileNotFoundError:
return None, f"文件不存在: {file_path}"
except Exception as e:
return None, f"读取文件失败: {str(e)}"
try:
converter = html2text.HTML2Text()
converter.ignore_emphasis = False
converter.ignore_links = False
converter.ignore_images = True
converter.body_width = 0
converter.skip_internal_links = True
markdown_content = converter.handle(html_content)
if not markdown_content.strip():
return None, "解析内容为空"
return markdown_content, None
except Exception as e:
return None, f"html2text 解析失败: {str(e)}"