Files
lyxy-document/scripts/readers/html/downloader/httpx.py
lanyuanxiaoyao 47038475d4 refactor: 将 HTML 下载器拆分为子包结构
将 scripts/readers/html/downloader.py (263行) 拆分为 downloader/ 子包,各下载器独立维护:

- 创建 downloader/ 子包,包含 __init__.py、common.py 和 4 个下载器模块
- common.py 集中管理公共配置(USER_AGENT、CHROME_ARGS 等)
- 各下载器统一接口 download(url: str) -> Tuple[Optional[str], Optional[str]]
- 在 __init__.py 定义 DOWNLOADERS 列表显式注册,参考 parser 模式
- 更新 html/__init__.py 导入语句,从 .downloader import download_html
- 添加完整的类型注解,提升代码可维护性
2026-03-09 01:13:42 +08:00

39 lines
1.1 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""使用 httpx 下载 URL轻量级 HTTP 客户端)"""
from typing import Optional, Tuple
from .common import USER_AGENT
def download(url: str) -> Tuple[Optional[str], Optional[str]]:
"""
使用 httpx 下载 URL轻量级 HTTP 客户端)
Args:
url: 目标 URL
Returns:
(content, error): content 成功时为 HTML 内容,失败时为 None
error 成功时为 None失败时为错误信息
"""
try:
import httpx
except ImportError:
return None, "httpx 库未安装"
headers = {
"User-Agent": USER_AGENT
}
try:
with httpx.Client(timeout=30.0) as client:
response = client.get(url, headers=headers)
if response.status_code == 200:
content = response.text
if not content or not content.strip():
return None, "下载内容为空"
return content, None
return None, f"HTTP {response.status_code}"
except Exception as e:
return None, f"httpx 下载失败: {str(e)}"