Files
lyxy-document/tests/test_readers/test_html/test_domscribe_html.py
lanyuanxiaoyao 2b81dd49fe refactor: 统一 HTML Reader 的 parse 签名,使用文件路径参数
将所有 HTML Parser 的函数签名从接收 HTML 字符串改为接收文件路径,
与其他 Reader(PDF、DOCX 等)保持一致。

主要变更:
- 修改 PARSERS 列表,移除 lambda 表达式,直接传递函数引用
- 在 HtmlReader.parse() 中统一管理临时文件(UTF-8 编码)
- 每个 Parser 使用独立的临时文件副本,用完即清理
- 移除 download_and_parse() 方法,逻辑合并到 parse() 中
- 更新相关测试,改为直接传递文件路径

受影响的 Parser:
- trafilatura.parse(html_content) -> parse(file_path)
- domscribe.parse(html_content) -> parse(file_path)
- markitdown.parse(html_content, temp_file_path) -> parse(file_path)
- html2text.parse(html_content) -> parse(file_path)
2026-03-09 00:05:23 +08:00

37 lines
1.4 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""测试 Domscribe HTML Reader 的解析功能。"""
import pytest
from scripts.readers.html import domscribe
class TestDomscribeHtmlReaderParse:
"""测试 Domscribe HTML Reader 的 parse 方法。"""
def test_normal_file(self, temp_html):
"""测试正常 HTML 文件解析。"""
file_path = temp_html(content="<h1>标题</h1><p>段落内容</p>")
content, error = domscribe.parse(file_path)
if content is not None:
assert "标题" in content or "段落" in content
def test_file_not_exists(self, tmp_path):
"""测试文件不存在的情况。"""
non_existent_path = str(tmp_path / "non_existent.html")
content, error = domscribe.parse(non_existent_path)
assert content is None
# 如果库未安装,也会返回 None但错误信息不同
assert error is not None
def test_empty_file(self, temp_html):
"""测试空 HTML 文件。"""
file_path = temp_html(content="<html><body></body></html>")
content, error = domscribe.parse(file_path)
assert content is None or content.strip() == ""
def test_special_chars(self, temp_html):
"""测试特殊字符处理。"""
file_path = temp_html(content="<p>中文测试 😀 ©®</p>")
content, error = domscribe.parse(file_path)
if content is not None:
assert "中文" in content or "测试" in content