Compare commits
2 Commits
e67ec24dfd
...
c90e1c98be
| Author | SHA1 | Date | |
|---|---|---|---|
| c90e1c98be | |||
| 229f17bfee |
22
README.md
22
README.md
@@ -10,17 +10,18 @@
|
||||
|
||||
- 使用 uv 运行脚本和测试,禁用主机 Python
|
||||
- 依赖管理:使用 `uv run --with` 按需加载依赖
|
||||
- 快速获取建议:使用 `-a/--advice` 参数查看执行命令
|
||||
- 自启动机制:脚本自动检测依赖并用正确的 uv 命令执行
|
||||
|
||||
## 项目架构
|
||||
|
||||
```
|
||||
scripts/
|
||||
├── lyxy_document_reader.py # CLI 入口
|
||||
├── lyxy_document_reader.py # CLI 入口(自启动)
|
||||
├── bootstrap.py # 实际执行模块
|
||||
├── config.py # 配置(含 DEPENDENCIES 依赖配置)
|
||||
├── core/ # 核心模块
|
||||
│ ├── parser.py # 解析调度
|
||||
│ ├── advice_generator.py # --advice 执行建议生成器
|
||||
│ ├── advice_generator.py # 依赖检测和配置生成
|
||||
│ ├── markdown.py # Markdown 工具
|
||||
│ └── exceptions.py # 异常定义
|
||||
├── readers/ # 格式阅读器
|
||||
@@ -94,9 +95,9 @@ DEPENDENCIES = {
|
||||
}
|
||||
```
|
||||
|
||||
### --advice 生成机制
|
||||
### 自启动机制
|
||||
|
||||
`--advice` 参数根据文件扩展名识别类型,检测当前平台,从 `config.DEPENDENCIES` 读取对应配置,生成 `uv run --with` 和 `pip install` 命令。
|
||||
入口脚本根据文件扩展名识别类型,检测当前平台,从 `config.DEPENDENCIES` 读取对应配置,自动生成并执行正确的 `uv run --with` 命令。
|
||||
|
||||
## 快速开始
|
||||
|
||||
@@ -105,8 +106,8 @@ DEPENDENCIES = {
|
||||
首先验证项目可以正常运行:
|
||||
|
||||
```bash
|
||||
# 测试 --advice 功能(无需额外依赖)
|
||||
uv run python scripts/lyxy_document_reader.py test.pdf --advice
|
||||
# 测试解析功能(自动检测依赖并执行)
|
||||
python scripts/lyxy_document_reader.py "https://example.com"
|
||||
```
|
||||
|
||||
### 运行基础测试
|
||||
@@ -115,7 +116,7 @@ uv run python scripts/lyxy_document_reader.py test.pdf --advice
|
||||
# 运行 CLI 测试(验证项目基本功能)
|
||||
uv run \
|
||||
--with pytest \
|
||||
pytest tests/test_cli/test_main.py::TestCLIAdviceOption -v
|
||||
pytest tests/test_cli/ -v
|
||||
```
|
||||
|
||||
## 开发指南
|
||||
@@ -242,11 +243,6 @@ uv run \
|
||||
--with pytest \
|
||||
pytest tests/test_cli/test_main.py
|
||||
|
||||
# 仅运行 --advice 相关测试(不需要额外依赖)
|
||||
uv run \
|
||||
--with pytest \
|
||||
pytest tests/test_cli/test_main.py::TestCLIAdviceOption
|
||||
|
||||
# 运行特定测试类或方法
|
||||
uv run \
|
||||
--with pytest \
|
||||
|
||||
37
SKILL.md
37
SKILL.md
@@ -11,16 +11,17 @@ compatibility: Requires Python 3.11+。优先使用 lyxy-runner-python skill,
|
||||
|
||||
### 执行路径选择(按优先级顺序)
|
||||
1. **lyxy-runner-python skill(首选)** - 自动管理依赖
|
||||
2. **uv run --with** - 按需加载依赖
|
||||
3. **主机 Python + pip install** - 手动安装依赖
|
||||
2. **python scripts/lyxy_document_reader.py** - 自启动,自动检测依赖
|
||||
3. **uv run --with** - 手动指定依赖
|
||||
4. **主机 Python + pip install** - 手动安装依赖
|
||||
|
||||
### 第一步:获取执行建议
|
||||
### 推荐用法
|
||||
```bash
|
||||
PYTHONPATH=. uv run --with pyarmor python scripts/lyxy_document_reader.py --advice <文件路径或URL>
|
||||
# 直接运行(自动检测依赖并执行)
|
||||
python scripts/lyxy_document_reader.py <文件路径或URL>
|
||||
```
|
||||
这会输出准确的执行命令,包含所需的依赖配置。
|
||||
|
||||
*也可以使用:`python scripts/lyxy_document_reader.py --advice <文件路径或URL>`*
|
||||
脚本会自动检测文件类型、当前平台,并用正确的 uv 命令执行。
|
||||
|
||||
## Purpose
|
||||
|
||||
@@ -50,7 +51,6 @@ PYTHONPATH=. uv run --with pyarmor python scripts/lyxy_document_reader.py --advi
|
||||
|
||||
| 参数 | 说明 |
|
||||
|------|------|
|
||||
| `-a/--advice` | 仅显示执行建议(**必须先运行此命令**) |
|
||||
| (无) | 输出完整 Markdown |
|
||||
| `-c/--count` | 字数统计 |
|
||||
| `-l/--lines` | 行数统计 |
|
||||
@@ -62,33 +62,28 @@ PYTHONPATH=. uv run --with pyarmor python scripts/lyxy_document_reader.py --advi
|
||||
## 参数使用示例
|
||||
|
||||
```bash
|
||||
# 获取执行建议
|
||||
PYTHONPATH=. uv run --with pyarmor python scripts/lyxy_document_reader.py --advice document.docx
|
||||
|
||||
# 读取全文
|
||||
PYTHONPATH=. uv run --with pyarmor python scripts/lyxy_document_reader.py document.docx
|
||||
# 读取全文(自动检测依赖)
|
||||
python scripts/lyxy_document_reader.py document.docx
|
||||
|
||||
# 统计字数
|
||||
PYTHONPATH=. uv run --with pyarmor python scripts/lyxy_document_reader.py document.docx -c
|
||||
python scripts/lyxy_document_reader.py document.docx -c
|
||||
|
||||
# 提取标题
|
||||
PYTHONPATH=. uv run --with pyarmor python scripts/lyxy_document_reader.py document.docx -t
|
||||
python scripts/lyxy_document_reader.py document.docx -t
|
||||
|
||||
# 提取指定章节
|
||||
PYTHONPATH=. uv run --with pyarmor python scripts/lyxy_document_reader.py document.docx -tc "第三章"
|
||||
python scripts/lyxy_document_reader.py document.docx -tc "第三章"
|
||||
|
||||
# 搜索内容
|
||||
PYTHONPATH=. uv run --with pyarmor python scripts/lyxy_document_reader.py document.docx -s "关键词"
|
||||
python scripts/lyxy_document_reader.py document.docx -s "关键词"
|
||||
|
||||
# 正则搜索
|
||||
PYTHONPATH=. uv run --with pyarmor python scripts/lyxy_document_reader.py document.docx -s "\d{4}-\d{2}-\d{2}"
|
||||
python scripts/lyxy_document_reader.py document.docx -s "\d{4}-\d{2}-\d{2}"
|
||||
|
||||
# 指定搜索上下文行数
|
||||
PYTHONPATH=. uv run --with pyarmor python scripts/lyxy_document_reader.py document.docx -s "关键词" -n 5
|
||||
python scripts/lyxy_document_reader.py document.docx -s "关键词" -n 5
|
||||
```
|
||||
|
||||
*也可以使用纯 python 命令:`python scripts/lyxy_document_reader.py ...`*
|
||||
|
||||
## 错误处理
|
||||
|
||||
| 错误 | 原因 | 解决 |
|
||||
@@ -98,4 +93,4 @@ PYTHONPATH=. uv run --with pyarmor python scripts/lyxy_document_reader.py docume
|
||||
| 所有解析方法均失败 | 所有解析器失败 | 检查文件是否损坏 |
|
||||
| 错误: 无效的正则表达式 | 正则语法错误 | 检查正则语法 |
|
||||
| 错误: 未找到匹配 | 搜索无结果 | 检查搜索词或正则 |
|
||||
| ModuleNotFoundError | 缺少依赖 | 使用 --advice 获取正确的依赖命令 |
|
||||
| ModuleNotFoundError | 缺少依赖 | 脚本会自动检测并安装依赖 |
|
||||
|
||||
@@ -1,11 +1,11 @@
|
||||
## Purpose
|
||||
|
||||
CLI 执行建议生成功能,根据文件类型返回 uv 和 python 命令,帮助 AI 快速获取准确的执行建议,无需翻阅文档。
|
||||
CLI 自启动机制,自动检测文件类型、平台和依赖,用正确的 uv 命令执行脚本。
|
||||
|
||||
## Requirements
|
||||
|
||||
### Requirement: 依赖配置结构
|
||||
依赖配置必须同时包含 python 版本要求和依赖包列表,按文件类型和平台组织。
|
||||
依赖配置必须同时包含 python 版本要求和依赖包列表,按文件类型和平台组织,供自启动逻辑内部使用。
|
||||
|
||||
#### Scenario: 配置结构包含 python 和 dependencies
|
||||
- **WHEN** 访问 `config.DEPENDENCIES` 时
|
||||
@@ -19,17 +19,8 @@ CLI 执行建议生成功能,根据文件类型返回 uv 和 python 命令,
|
||||
|
||||
---
|
||||
|
||||
### Requirement: CLI 支持 --advice 参数
|
||||
命令行工具必须支持 `-a/--advice` 参数,当指定该参数时不执行实际解析,仅输出执行建议。
|
||||
|
||||
#### Scenario: 用户指定 --advice 参数
|
||||
- **WHEN** 用户执行 `scripts/lyxy_document_reader.py --advice <input_path>`
|
||||
- **THEN** 工具输出执行建议,不解析文件内容
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 轻量文件类型检测
|
||||
`--advice` 参数必须复用 Reader 实例的 supports 方法识别文件类型,不打开文件。
|
||||
自启动必须复用 Reader 实例的 supports 方法识别文件类型,不打开文件。
|
||||
|
||||
#### Scenario: 复用 Reader 实例
|
||||
- **WHEN** 检测文件类型时
|
||||
@@ -69,72 +60,70 @@ CLI 执行建议生成功能,根据文件类型返回 uv 和 python 命令,
|
||||
|
||||
#### Scenario: 不验证文件存在
|
||||
- **WHEN** 输入路径指向不存在的文件
|
||||
- **THEN** 仍根据 reader.supports() 返回建议,不报错
|
||||
- **THEN** 仍根据 reader.supports() 识别类型,不报错
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 平台检测
|
||||
必须检测当前平台并返回适配的命令。
|
||||
必须检测当前平台并选择适配的依赖配置。
|
||||
|
||||
#### Scenario: 检测平台格式
|
||||
- **WHEN** 工具执行时
|
||||
- **THEN** 返回格式为 `{system}-{machine}`,例如 `Darwin-arm64`、`Linux-x86_64`、`Windows-AMD64`
|
||||
|
||||
#### Scenario: macOS x86_64 PDF 特殊命令
|
||||
#### Scenario: macOS x86_64 PDF 特殊配置
|
||||
- **WHEN** 平台为 `Darwin-x86_64` 且文件类型为 PDF
|
||||
- **THEN** 返回包含 `--python 3.12` 和特定版本依赖的命令
|
||||
- **THEN** 使用包含 `--python 3.12` 和特定版本依赖的配置
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 输出 uv 命令
|
||||
必须输出使用 `uv run --with ...` 格式的命令。
|
||||
### Requirement: 自启动检测
|
||||
脚本必须自动检测文件类型、当前平台和 uv 可用性,如 uv 可用则用正确的 uv 命令启动 bootstrap.py。
|
||||
|
||||
#### Scenario: 检测文件类型
|
||||
- **WHEN** 脚本启动时
|
||||
- **THEN** 复用 Reader 的 supports() 方法识别文件类型
|
||||
- **AND** 不打开文件,仅做轻量检测
|
||||
|
||||
#### Scenario: 检测平台
|
||||
- **WHEN** 脚本启动时
|
||||
- **THEN** 检测当前平台,格式为 `{system}-{machine}`
|
||||
- **AND** 根据平台选择正确的依赖配置
|
||||
|
||||
#### Scenario: 检测 uv 是否可用
|
||||
- **WHEN** 准备自启动前
|
||||
- **THEN** 使用 `shutil.which("uv")` 检测 uv 是否在 PATH 中
|
||||
- **AND** 如果 uv 不可用,降级为直接执行 bootstrap.py
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 自启动执行
|
||||
脚本必须使用 `subprocess.run()` 启动子进程,用正确的 uv 命令启动 bootstrap.py。
|
||||
|
||||
#### Scenario: 生成 uv 命令
|
||||
- **WHEN** 检测到文件类型
|
||||
- **THEN** 输出格式为:`uv run [--python X.Y] --with <dep1> --with <dep2> ... scripts/lyxy_document_reader.py <input_path>`
|
||||
- **WHEN** 脚本确定需要自启动
|
||||
- **THEN** 根据文件类型和平台获取依赖配置
|
||||
- **AND** 生成 `uv run [--python X.Y] --with <dep1> --with <dep2> ... scripts/bootstrap.py <input_path>` 命令
|
||||
- **AND** 目标脚本是 bootstrap.py,不是 lyxy_document_reader.py
|
||||
|
||||
#### Scenario: 自启动设置环境变量
|
||||
- **WHEN** 执行 `subprocess.run()` 自启动
|
||||
- **THEN** 必须设置 `PYTHONPATH=.`
|
||||
- **AND** 不需要设置 `LYXY_IN_UV`(自启动直接调用 bootstrap.py)
|
||||
- **AND** 必须传递退出码给父进程
|
||||
|
||||
#### Scenario: 静默自启动
|
||||
- **WHEN** 脚本执行自启动
|
||||
- **THEN** 不输出任何额外提示信息
|
||||
- **AND** 不干扰正常的 Markdown 输出
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 输出 python 命令
|
||||
必须输出直接使用 python 的命令及 pip 安装命令。
|
||||
### Requirement: 降级执行
|
||||
当 uv 不可用时,脚本必须降级为直接导入并执行 bootstrap.py。
|
||||
|
||||
#### Scenario: 生成 python 命令
|
||||
- **WHEN** 检测到文件类型
|
||||
- **THEN** 输出 python 命令:`python scripts/lyxy_document_reader.py <input_path>`
|
||||
- **AND** 输出 pip 安装命令:`pip install <dep1> <dep2> ...`
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 输出格式规范
|
||||
输出必须包含文件类型、输入路径、平台(如需要)、uv 命令、python 命令和 pip 安装命令。
|
||||
|
||||
#### Scenario: 普通平台输出格式
|
||||
- **WHEN** 平台无特殊配置
|
||||
- **THEN** 输出格式为:
|
||||
```
|
||||
文件类型: <type>
|
||||
输入路径: <input>
|
||||
|
||||
[uv 命令]
|
||||
<uv_command>
|
||||
|
||||
[python 命令]
|
||||
python scripts/lyxy_document_reader.py <input>
|
||||
pip install <deps>
|
||||
```
|
||||
|
||||
#### Scenario: 特殊平台输出格式
|
||||
- **WHEN** 平台有特殊配置
|
||||
- **THEN** 输出格式为:
|
||||
```
|
||||
文件类型: <type>
|
||||
输入路径: <input>
|
||||
平台: <system-machine>
|
||||
|
||||
[uv 命令]
|
||||
<uv_command>
|
||||
|
||||
[python 命令]
|
||||
python scripts/lyxy_document_reader.py <input>
|
||||
pip install <deps>
|
||||
```
|
||||
#### Scenario: uv 不可用时降级
|
||||
- **WHEN** uv 不在 PATH 中
|
||||
- **THEN** 脚本直接导入 bootstrap 模块
|
||||
- **AND** 调用 bootstrap.run_normal() 执行
|
||||
- **AND** 如果缺少依赖,输出正常的 `ModuleNotFoundError`
|
||||
|
||||
111
scripts/bootstrap.py
Normal file
111
scripts/bootstrap.py
Normal file
@@ -0,0 +1,111 @@
|
||||
#!/usr/bin/env python3
|
||||
"""文档解析器实际执行模块,承载业务逻辑。"""
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
import warnings
|
||||
from pathlib import Path
|
||||
|
||||
# 将 scripts/ 目录添加到 sys.path,支持从任意位置执行脚本
|
||||
scripts_dir = Path(__file__).resolve().parent
|
||||
if str(scripts_dir) not in sys.path:
|
||||
sys.path.append(str(scripts_dir))
|
||||
|
||||
# 抑制第三方库的进度条和日志,仅保留解析结果输出
|
||||
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
|
||||
os.environ["HF_HUB_DISABLE_TELEMETRY"] = "1"
|
||||
os.environ["TQDM_DISABLE"] = "1"
|
||||
warnings.filterwarnings("ignore")
|
||||
|
||||
# 配置日志系统,只输出 ERROR 级别
|
||||
logging.basicConfig(level=logging.ERROR, format='%(levelname)s: %(message)s')
|
||||
|
||||
# 设置第三方库日志等级
|
||||
logging.getLogger('docling').setLevel(logging.ERROR)
|
||||
logging.getLogger('unstructured').setLevel(logging.ERROR)
|
||||
|
||||
from core import (
|
||||
FileDetectionError,
|
||||
ReaderNotFoundError,
|
||||
output_result,
|
||||
parse_input,
|
||||
process_content,
|
||||
)
|
||||
from readers import READERS
|
||||
|
||||
|
||||
def run_normal(args) -> None:
|
||||
"""正常执行模式:解析文件并输出结果"""
|
||||
# 实例化所有 readers
|
||||
readers = [ReaderCls() for ReaderCls in READERS]
|
||||
|
||||
try:
|
||||
content, failures = parse_input(args.input_path, readers)
|
||||
except FileDetectionError as e:
|
||||
print(f"错误: {e}")
|
||||
sys.exit(1)
|
||||
except ReaderNotFoundError as e:
|
||||
print(f"错误: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
if content is None:
|
||||
print("所有解析方法均失败:")
|
||||
for failure in failures:
|
||||
print(failure)
|
||||
sys.exit(1)
|
||||
|
||||
# 处理内容
|
||||
content = process_content(content)
|
||||
|
||||
# 输出结果
|
||||
output_result(content, args)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
"""主函数:解析命令行参数并执行"""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="将 DOCX、XLS、XLSX、PPTX、PDF、HTML 文件或 URL 解析为 Markdown"
|
||||
)
|
||||
|
||||
parser.add_argument("input_path", help="DOCX、XLS、XLSX、PPTX、PDF、HTML 文件或 URL")
|
||||
|
||||
parser.add_argument(
|
||||
"-n",
|
||||
"--context",
|
||||
type=int,
|
||||
default=2,
|
||||
help="与 -s 配合使用,指定每个检索结果包含的前后行数(不包含空行)",
|
||||
)
|
||||
|
||||
group = parser.add_mutually_exclusive_group()
|
||||
group.add_argument(
|
||||
"-c", "--count", action="store_true", help="返回解析后的 markdown 文档的总字数"
|
||||
)
|
||||
group.add_argument(
|
||||
"-l", "--lines", action="store_true", help="返回解析后的 markdown 文档的总行数"
|
||||
)
|
||||
group.add_argument(
|
||||
"-t",
|
||||
"--titles",
|
||||
action="store_true",
|
||||
help="返回解析后的 markdown 文档的标题行(1-6级)",
|
||||
)
|
||||
group.add_argument(
|
||||
"-tc",
|
||||
"--title-content",
|
||||
help="指定标题名称,输出该标题及其下级内容(不包含#号)",
|
||||
)
|
||||
group.add_argument(
|
||||
"-s",
|
||||
"--search",
|
||||
help="使用正则表达式搜索文档,返回所有匹配结果(用---分隔)",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
run_normal(args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,56 +1,31 @@
|
||||
#!/usr/bin/env python3
|
||||
"""文档解析器命令行交互模块,提供命令行接口。支持 DOCX、XLS、XLSX、PPTX、PDF、HTML 和 URL。"""
|
||||
"""文档解析器入口 - 环境检测和自启动"""
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
import shutil
|
||||
import subprocess
|
||||
import sys
|
||||
import warnings
|
||||
from pathlib import Path
|
||||
|
||||
# 将 scripts/ 目录添加到 sys.path,支持从任意位置执行脚本
|
||||
# 将 scripts/ 目录添加到 sys.path
|
||||
scripts_dir = Path(__file__).resolve().parent
|
||||
if str(scripts_dir) not in sys.path:
|
||||
sys.path.append(str(scripts_dir))
|
||||
|
||||
# 抑制第三方库的进度条和日志,仅保留解析结果输出
|
||||
# 抑制第三方库日志
|
||||
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
|
||||
os.environ["HF_HUB_DISABLE_TELEMETRY"] = "1"
|
||||
os.environ["TQDM_DISABLE"] = "1"
|
||||
warnings.filterwarnings("ignore")
|
||||
|
||||
# 配置日志系统,只输出 ERROR 级别
|
||||
logging.basicConfig(level=logging.ERROR, format='%(levelname)s: %(message)s')
|
||||
|
||||
# 设置第三方库日志等级
|
||||
logging.getLogger('docling').setLevel(logging.ERROR)
|
||||
logging.getLogger('unstructured').setLevel(logging.ERROR)
|
||||
|
||||
from core import (
|
||||
FileDetectionError,
|
||||
ReaderNotFoundError,
|
||||
output_result,
|
||||
parse_input,
|
||||
process_content,
|
||||
generate_advice,
|
||||
)
|
||||
from readers import READERS
|
||||
|
||||
|
||||
def main() -> None:
|
||||
def main():
|
||||
"""主函数:环境检测和决策"""
|
||||
# 解析命令行参数(轻量,仅识别必要参数)
|
||||
parser = argparse.ArgumentParser(
|
||||
description="将 DOCX、XLS、XLSX、PPTX、PDF、HTML 文件或 URL 解析为 Markdown"
|
||||
)
|
||||
|
||||
parser.add_argument("input_path", help="DOCX、XLS、XLSX、PPTX、PDF、HTML 文件或 URL")
|
||||
|
||||
parser.add_argument(
|
||||
"-a",
|
||||
"--advice",
|
||||
action="store_true",
|
||||
help="仅显示执行建议,不实际解析文件",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"-n",
|
||||
"--context",
|
||||
@@ -58,7 +33,6 @@ def main() -> None:
|
||||
default=2,
|
||||
help="与 -s 配合使用,指定每个检索结果包含的前后行数(不包含空行)",
|
||||
)
|
||||
|
||||
group = parser.add_mutually_exclusive_group()
|
||||
group.add_argument(
|
||||
"-c", "--count", action="store_true", help="返回解析后的 markdown 文档的总字数"
|
||||
@@ -85,39 +59,64 @@ def main() -> None:
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# 实例化所有 readers
|
||||
readers = [ReaderCls() for ReaderCls in READERS]
|
||||
# 检测 uv 是否可用
|
||||
uv_path = shutil.which("uv")
|
||||
|
||||
# --advice 模式:仅显示建议,不解析
|
||||
if args.advice:
|
||||
advice = generate_advice(args.input_path, readers, "scripts/lyxy_document_reader.py")
|
||||
if advice:
|
||||
print(advice)
|
||||
else:
|
||||
print(f"错误: 无法识别文件类型: {args.input_path}")
|
||||
sys.exit(1)
|
||||
if not uv_path:
|
||||
# uv 不可用,降级为直接执行 bootstrap.py
|
||||
import bootstrap
|
||||
bootstrap.run_normal(args)
|
||||
return
|
||||
|
||||
try:
|
||||
content, failures = parse_input(args.input_path, readers)
|
||||
except FileDetectionError as e:
|
||||
print(f"错误: {e}")
|
||||
sys.exit(1)
|
||||
except ReaderNotFoundError as e:
|
||||
print(f"错误: {e}")
|
||||
sys.exit(1)
|
||||
# uv 可用,需要自启动
|
||||
# 导入依赖检测模块
|
||||
from config import DEPENDENCIES
|
||||
from core.advice_generator import (
|
||||
detect_file_type_light,
|
||||
get_platform,
|
||||
get_dependencies,
|
||||
)
|
||||
from readers import READERS
|
||||
|
||||
if content is None:
|
||||
print("所有解析方法均失败:")
|
||||
for failure in failures:
|
||||
print(failure)
|
||||
sys.exit(1)
|
||||
# 检测文件类型
|
||||
readers = [ReaderCls() for ReaderCls in READERS]
|
||||
reader_cls = detect_file_type_light(args.input_path, readers)
|
||||
|
||||
# 处理内容
|
||||
content = process_content(content)
|
||||
if not reader_cls:
|
||||
# 无法识别文件类型,降级执行让它报错
|
||||
import bootstrap
|
||||
bootstrap.run_normal(args)
|
||||
return
|
||||
|
||||
# 输出结果
|
||||
output_result(content, args)
|
||||
# 获取平台和依赖配置
|
||||
platform_id = get_platform()
|
||||
python_version, dependencies = get_dependencies(reader_cls, platform_id)
|
||||
|
||||
# 生成 uv 命令参数列表
|
||||
uv_args = ["uv", "run"]
|
||||
|
||||
if python_version:
|
||||
uv_args.extend(["--python", python_version])
|
||||
|
||||
# 始终添加 pyarmor 依赖(混淆后脚本需要)
|
||||
uv_args.extend(["--with", "pyarmor"])
|
||||
|
||||
for dep in dependencies:
|
||||
uv_args.extend(["--with", dep])
|
||||
|
||||
# 目标脚本是 bootstrap.py
|
||||
uv_args.append("scripts/bootstrap.py")
|
||||
|
||||
# 添加所有命令行参数
|
||||
uv_args.extend(sys.argv[1:])
|
||||
|
||||
# 设置环境变量
|
||||
env = os.environ.copy()
|
||||
env["PYTHONPATH"] = "."
|
||||
|
||||
# 自启动:使用 subprocess 替代 execvpe(Windows 兼容)
|
||||
result = subprocess.run(uv_args, env=env)
|
||||
sys.exit(result.returncode)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
@@ -51,10 +51,7 @@ def temp_docx(tmp_path):
|
||||
str: 临时文件路径
|
||||
"""
|
||||
def _create_docx(paragraphs=None, headings=None, table_data=None, list_items=None):
|
||||
try:
|
||||
from docx import Document
|
||||
except ImportError:
|
||||
pytest.skip("python-docx 未安装")
|
||||
from docx import Document
|
||||
|
||||
doc = Document()
|
||||
|
||||
@@ -99,13 +96,10 @@ def temp_pdf(tmp_path):
|
||||
str: 临时文件路径
|
||||
"""
|
||||
def _create_pdf(text=None, lines=None):
|
||||
try:
|
||||
from reportlab.pdfgen import canvas
|
||||
from reportlab.lib.pagesizes import letter
|
||||
from reportlab.pdfbase import pdfmetrics
|
||||
from reportlab.pdfbase.ttfonts import TTFont
|
||||
except ImportError:
|
||||
pytest.skip("reportlab 未安装")
|
||||
from reportlab.pdfgen import canvas
|
||||
from reportlab.lib.pagesizes import letter
|
||||
from reportlab.pdfbase import pdfmetrics
|
||||
from reportlab.pdfbase.ttfonts import TTFont
|
||||
|
||||
file_path = tmp_path / "test.pdf"
|
||||
c = canvas.Canvas(str(file_path), pagesize=letter)
|
||||
@@ -176,10 +170,7 @@ def temp_pptx(tmp_path):
|
||||
str: 临时文件路径
|
||||
"""
|
||||
def _create_pptx(slides=None):
|
||||
try:
|
||||
from pptx import Presentation
|
||||
except ImportError:
|
||||
pytest.skip("python-pptx 未安装")
|
||||
from pptx import Presentation
|
||||
|
||||
prs = Presentation()
|
||||
|
||||
@@ -209,10 +200,7 @@ def temp_xlsx(tmp_path):
|
||||
str: 临时文件路径
|
||||
"""
|
||||
def _create_xlsx(data=None):
|
||||
try:
|
||||
import pandas as pd
|
||||
except ImportError:
|
||||
pytest.skip("pandas 未安装")
|
||||
import pandas as pd
|
||||
|
||||
file_path = tmp_path / "test.xlsx"
|
||||
|
||||
|
||||
@@ -29,7 +29,9 @@ def cli_runner():
|
||||
if str(scripts_dir) not in sys.path:
|
||||
sys.path.insert(0, str(scripts_dir))
|
||||
|
||||
from lyxy_document_reader import main
|
||||
# 直接调用 bootstrap.main() 而不是 lyxy_document_reader.main()
|
||||
# 因为 lyxy_document_reader 会调用 subprocess,无法捕获输出
|
||||
from bootstrap import main
|
||||
|
||||
# 保存原始 sys.argv 和 sys.exit
|
||||
original_argv = sys.argv
|
||||
@@ -46,7 +48,7 @@ def cli_runner():
|
||||
|
||||
try:
|
||||
# 设置命令行参数
|
||||
sys.argv = ['lyxy_document_reader'] + args
|
||||
sys.argv = ['bootstrap'] + args
|
||||
sys.exit = mock_exit
|
||||
|
||||
# 捕获输出
|
||||
|
||||
@@ -4,48 +4,6 @@ import pytest
|
||||
import os
|
||||
|
||||
|
||||
class TestCLIAdviceOption:
|
||||
"""测试 CLI --advice 参数功能。"""
|
||||
|
||||
def test_advice_option_pdf(self, cli_runner):
|
||||
"""测试 -a/--advice 选项对 PDF 文件。"""
|
||||
stdout, stderr, exit_code = cli_runner(["test.pdf", "-a"])
|
||||
|
||||
assert exit_code == 0
|
||||
assert "文件类型: PDF" in stdout
|
||||
assert "[uv 命令]" in stdout
|
||||
assert "[python 命令]" in stdout
|
||||
|
||||
def test_advice_option_docx(self, cli_runner):
|
||||
"""测试 --advice 选项对 DOCX 文件。"""
|
||||
stdout, stderr, exit_code = cli_runner(["test.docx", "--advice"])
|
||||
|
||||
assert exit_code == 0
|
||||
assert "文件类型: DOCX" in stdout
|
||||
|
||||
def test_advice_option_url(self, cli_runner):
|
||||
"""测试 --advice 选项对 URL。"""
|
||||
stdout, stderr, exit_code = cli_runner(["https://example.com", "--advice"])
|
||||
|
||||
assert exit_code == 0
|
||||
assert "文件类型: HTML" in stdout
|
||||
|
||||
def test_advice_option_unknown(self, cli_runner):
|
||||
"""测试 --advice 选项对未知文件类型。"""
|
||||
stdout, stderr, exit_code = cli_runner(["test.xyz", "--advice"])
|
||||
|
||||
assert exit_code != 0
|
||||
output = stdout + stderr
|
||||
assert "无法识别" in output or "错误" in output
|
||||
|
||||
def test_advice_option_xls(self, cli_runner):
|
||||
"""测试 --advice 选项对 XLS 文件。"""
|
||||
stdout, stderr, exit_code = cli_runner(["test.xls", "--advice"])
|
||||
|
||||
assert exit_code == 0
|
||||
assert "文件类型: XLS" in stdout
|
||||
|
||||
|
||||
class TestCLIDefaultOutput:
|
||||
"""测试 CLI 默认输出功能。"""
|
||||
|
||||
|
||||
@@ -131,7 +131,7 @@ class TestGeneratePythonCommand:
|
||||
script_path="scripts/lyxy_document_reader.py"
|
||||
)
|
||||
assert python_cmd == "python scripts/lyxy_document_reader.py input.pdf"
|
||||
assert pip_cmd == "pip install pkg1 pkg2"
|
||||
assert pip_cmd == "pip install pyarmor pkg1 pkg2"
|
||||
|
||||
|
||||
class TestFormatAdvice:
|
||||
|
||||
233
tests/test_core/test_markdown_extra.py
Normal file
233
tests/test_core/test_markdown_extra.py
Normal file
@@ -0,0 +1,233 @@
|
||||
"""测试 markdown 模块的高级功能(extract_title_content, search_markdown)。"""
|
||||
|
||||
import pytest
|
||||
|
||||
from core.markdown import extract_title_content, search_markdown
|
||||
|
||||
|
||||
class TestExtractTitleContent:
|
||||
"""测试 extract_title_content 函数。"""
|
||||
|
||||
def test_extract_simple_title(self):
|
||||
"""测试提取简单标题。"""
|
||||
markdown = """# 目标标题
|
||||
|
||||
这是标题下的内容。
|
||||
第二段内容。"""
|
||||
|
||||
result = extract_title_content(markdown, "目标标题")
|
||||
|
||||
assert result is not None
|
||||
assert "# 目标标题" in result
|
||||
assert "这是标题下的内容" in result
|
||||
|
||||
def test_extract_with_subtitles(self):
|
||||
"""测试提取包含子标题的内容。"""
|
||||
markdown = """# 目标标题
|
||||
|
||||
这是标题下的内容。
|
||||
|
||||
## 子标题
|
||||
|
||||
子标题下的内容。
|
||||
|
||||
### 孙子标题
|
||||
|
||||
更深层的内容。"""
|
||||
|
||||
result = extract_title_content(markdown, "目标标题")
|
||||
|
||||
assert result is not None
|
||||
assert "# 目标标题" in result
|
||||
assert "## 子标题" in result
|
||||
assert "### 孙子标题" in result
|
||||
|
||||
def test_extract_stop_at_sibling_title(self):
|
||||
"""测试在同级标题处停止。"""
|
||||
markdown = """# 目标标题
|
||||
|
||||
目标内容。
|
||||
|
||||
# 另一个标题
|
||||
|
||||
另一个内容。"""
|
||||
|
||||
result = extract_title_content(markdown, "目标标题")
|
||||
|
||||
assert result is not None
|
||||
assert "# 目标标题" in result
|
||||
assert "目标内容" in result
|
||||
assert "# 另一个标题" not in result
|
||||
|
||||
def test_extract_with_parent_titles(self):
|
||||
"""测试包含父级标题。"""
|
||||
markdown = """# 父级标题
|
||||
|
||||
父级内容。
|
||||
|
||||
## 目标标题
|
||||
|
||||
目标内容。
|
||||
|
||||
### 子标题
|
||||
|
||||
子内容。"""
|
||||
|
||||
result = extract_title_content(markdown, "目标标题")
|
||||
|
||||
assert result is not None
|
||||
assert "# 父级标题" in result
|
||||
assert "## 目标标题" in result
|
||||
assert "### 子标题" in result
|
||||
|
||||
def test_extract_multiple_matches(self):
|
||||
"""测试多个匹配标题的情况。"""
|
||||
markdown = """# 第一章
|
||||
|
||||
## 目标标题
|
||||
|
||||
第一章的目标内容。
|
||||
|
||||
# 第二章
|
||||
|
||||
## 目标标题
|
||||
|
||||
第二章的目标内容。"""
|
||||
|
||||
result = extract_title_content(markdown, "目标标题")
|
||||
|
||||
assert result is not None
|
||||
assert "第一章的目标内容" in result
|
||||
assert "第二章的目标内容" in result
|
||||
assert "---" in result
|
||||
|
||||
def test_title_not_found(self):
|
||||
"""测试标题不存在的情况。"""
|
||||
markdown = "# 其他标题\n内容"
|
||||
|
||||
result = extract_title_content(markdown, "不存在的标题")
|
||||
|
||||
assert result is None
|
||||
|
||||
def test_deep_nested_title(self):
|
||||
"""测试深层嵌套标题。"""
|
||||
markdown = """# H1
|
||||
|
||||
## H2
|
||||
|
||||
### H3
|
||||
|
||||
#### 目标标题
|
||||
|
||||
目标内容。"""
|
||||
|
||||
result = extract_title_content(markdown, "目标标题")
|
||||
|
||||
assert result is not None
|
||||
assert "# H1" in result
|
||||
assert "## H2" in result
|
||||
assert "### H3" in result
|
||||
assert "#### 目标标题" in result
|
||||
|
||||
|
||||
class TestSearchMarkdown:
|
||||
"""测试 search_markdown 函数。"""
|
||||
|
||||
def test_search_simple_pattern(self):
|
||||
"""测试简单搜索模式。"""
|
||||
content = """第一行
|
||||
第二行
|
||||
包含关键词的行
|
||||
第四行"""
|
||||
|
||||
result = search_markdown(content, "关键词", context_lines=0)
|
||||
|
||||
assert result is not None
|
||||
assert "关键词" in result
|
||||
|
||||
def test_search_with_context(self):
|
||||
"""测试带上下文的搜索。"""
|
||||
content = """行1
|
||||
行2
|
||||
关键词行
|
||||
行4
|
||||
行5"""
|
||||
|
||||
result = search_markdown(content, "关键词", context_lines=1)
|
||||
|
||||
assert result is not None
|
||||
assert "关键词" in result
|
||||
assert "行2" in result or "行4" in result
|
||||
|
||||
def test_search_no_match(self):
|
||||
"""测试无匹配的情况。"""
|
||||
content = "普通内容"
|
||||
|
||||
result = search_markdown(content, "不存在的内容", context_lines=0)
|
||||
|
||||
assert result is None
|
||||
|
||||
def test_search_empty_content(self):
|
||||
"""测试空内容。"""
|
||||
result = search_markdown("", "关键词", context_lines=0)
|
||||
|
||||
assert result is None
|
||||
|
||||
def test_search_invalid_regex(self):
|
||||
"""测试无效正则表达式。"""
|
||||
content = "内容"
|
||||
|
||||
result = search_markdown(content, "[invalid", context_lines=0)
|
||||
|
||||
assert result is None
|
||||
|
||||
def test_search_negative_context(self):
|
||||
"""测试负的上下文行数。"""
|
||||
content = "内容"
|
||||
|
||||
with pytest.raises(ValueError):
|
||||
search_markdown(content, "内容", context_lines=-1)
|
||||
|
||||
def test_search_multiple_matches_merged(self):
|
||||
"""测试多个匹配合并。"""
|
||||
content = """行1
|
||||
行2
|
||||
匹配1
|
||||
行4
|
||||
行5
|
||||
匹配2
|
||||
行7
|
||||
行8"""
|
||||
|
||||
result = search_markdown(content, "匹配", context_lines=1)
|
||||
|
||||
assert result is not None
|
||||
assert "匹配1" in result
|
||||
assert "匹配2" in result
|
||||
|
||||
def test_search_ignore_blank_lines_in_context(self):
|
||||
"""测试上下文计算忽略空行。"""
|
||||
content = """行1
|
||||
|
||||
行2
|
||||
关键词
|
||||
|
||||
行4
|
||||
行5"""
|
||||
|
||||
result = search_markdown(content, "关键词", context_lines=1)
|
||||
|
||||
assert result is not None
|
||||
assert "关键词" in result
|
||||
|
||||
def test_search_with_regex(self):
|
||||
"""测试使用正则表达式搜索。"""
|
||||
content = """apple
|
||||
banana
|
||||
cherry
|
||||
date"""
|
||||
|
||||
result = search_markdown(content, "^b", context_lines=0)
|
||||
|
||||
assert result is not None
|
||||
assert "banana" in result
|
||||
256
tests/test_core/test_parser.py
Normal file
256
tests/test_core/test_parser.py
Normal file
@@ -0,0 +1,256 @@
|
||||
"""测试 parser 模块的解析调度功能。"""
|
||||
|
||||
import pytest
|
||||
from unittest.mock import patch, MagicMock
|
||||
import argparse
|
||||
import sys
|
||||
|
||||
from core.parser import parse_input, process_content, output_result
|
||||
from core.exceptions import FileDetectionError, ReaderNotFoundError
|
||||
|
||||
|
||||
class MockReader:
|
||||
"""模拟 Reader 类用于测试。"""
|
||||
|
||||
def __init__(self, supports=True, content=None, failures=None):
|
||||
self._supports = supports
|
||||
self._content = content
|
||||
self._failures = failures or []
|
||||
|
||||
def supports(self, file_path):
|
||||
return self._supports
|
||||
|
||||
def parse(self, file_path):
|
||||
return self._content, self._failures
|
||||
|
||||
|
||||
class TestParseInput:
|
||||
"""测试 parse_input 函数。"""
|
||||
|
||||
def test_parse_input_success(self):
|
||||
"""测试成功解析的情况。"""
|
||||
reader = MockReader(supports=True, content="测试内容", failures=[])
|
||||
readers = [reader]
|
||||
|
||||
content, failures = parse_input("test.docx", readers)
|
||||
|
||||
assert content == "测试内容"
|
||||
assert failures == []
|
||||
|
||||
def test_parse_input_reader_not_found(self):
|
||||
"""测试没有找到支持的 reader。"""
|
||||
reader = MockReader(supports=False)
|
||||
readers = [reader]
|
||||
|
||||
with pytest.raises(ReaderNotFoundError):
|
||||
parse_input("test.docx", readers)
|
||||
|
||||
def test_parse_input_empty_path(self):
|
||||
"""测试空输入路径。"""
|
||||
readers = [MockReader()]
|
||||
|
||||
with pytest.raises(FileDetectionError):
|
||||
parse_input("", readers)
|
||||
|
||||
def test_parse_input_multiple_readers_first_succeeds(self):
|
||||
"""测试多个 reader,第一个成功。"""
|
||||
reader1 = MockReader(supports=True, content="第一个结果", failures=[])
|
||||
reader2 = MockReader(supports=True, content="第二个结果", failures=[])
|
||||
readers = [reader1, reader2]
|
||||
|
||||
content, failures = parse_input("test.docx", readers)
|
||||
|
||||
assert content == "第一个结果"
|
||||
|
||||
def test_parse_input_with_failures(self):
|
||||
"""测试解析返回失败信息。"""
|
||||
reader = MockReader(
|
||||
supports=True,
|
||||
content=None,
|
||||
failures=["解析器1失败", "解析器2失败"]
|
||||
)
|
||||
readers = [reader]
|
||||
|
||||
content, failures = parse_input("test.docx", readers)
|
||||
|
||||
assert content is None
|
||||
assert failures == ["解析器1失败", "解析器2失败"]
|
||||
|
||||
|
||||
class TestProcessContent:
|
||||
"""测试 process_content 函数。"""
|
||||
|
||||
def test_process_content_removes_images(self):
|
||||
"""测试移除图片标记。"""
|
||||
content = "测试内容  更多内容"
|
||||
result = process_content(content)
|
||||
|
||||
assert "" not in result
|
||||
assert "测试内容" in result
|
||||
assert "更多内容" in result
|
||||
|
||||
def test_process_content_normalizes_whitespace(self):
|
||||
"""测试规范化空白字符。"""
|
||||
content = "line1\n\n\n\nline2\n\n\nline3"
|
||||
result = process_content(content)
|
||||
|
||||
assert "line1\n\nline2\n\nline3" in result
|
||||
|
||||
def test_process_content_both_operations(self):
|
||||
"""测试同时执行两个操作。"""
|
||||
content = "\n\n\n\n正文"
|
||||
result = process_content(content)
|
||||
|
||||
assert "" not in result
|
||||
assert "\n\n\n\n" not in result
|
||||
|
||||
|
||||
class TestOutputResult:
|
||||
"""测试 output_result 函数。"""
|
||||
|
||||
def test_output_default(self, capsys):
|
||||
"""测试默认输出内容。"""
|
||||
args = argparse.Namespace(
|
||||
count=False,
|
||||
lines=False,
|
||||
titles=False,
|
||||
title_content=None,
|
||||
search=None,
|
||||
context=2
|
||||
)
|
||||
|
||||
output_result("测试内容", args)
|
||||
|
||||
captured = capsys.readouterr()
|
||||
assert "测试内容" in captured.out
|
||||
|
||||
def test_output_count(self, capsys):
|
||||
"""测试字数统计。"""
|
||||
args = argparse.Namespace(
|
||||
count=True,
|
||||
lines=False,
|
||||
titles=False,
|
||||
title_content=None,
|
||||
search=None,
|
||||
context=2
|
||||
)
|
||||
|
||||
output_result("测试内容", args)
|
||||
|
||||
captured = capsys.readouterr()
|
||||
assert captured.out.strip() == "4"
|
||||
|
||||
def test_output_lines(self, capsys):
|
||||
"""测试行数统计。"""
|
||||
args = argparse.Namespace(
|
||||
count=False,
|
||||
lines=True,
|
||||
titles=False,
|
||||
title_content=None,
|
||||
search=None,
|
||||
context=2
|
||||
)
|
||||
|
||||
output_result("line1\nline2\nline3", args)
|
||||
|
||||
captured = capsys.readouterr()
|
||||
assert captured.out.strip() == "3"
|
||||
|
||||
def test_output_titles(self, capsys):
|
||||
"""测试提取标题。"""
|
||||
args = argparse.Namespace(
|
||||
count=False,
|
||||
lines=False,
|
||||
titles=True,
|
||||
title_content=None,
|
||||
search=None,
|
||||
context=2
|
||||
)
|
||||
|
||||
content = "# 标题1\n正文\n## 标题2\n正文"
|
||||
output_result(content, args)
|
||||
|
||||
captured = capsys.readouterr()
|
||||
assert "# 标题1" in captured.out
|
||||
assert "## 标题2" in captured.out
|
||||
|
||||
def test_output_title_content_found(self, capsys):
|
||||
"""测试提取标题内容(找到)。"""
|
||||
args = argparse.Namespace(
|
||||
count=False,
|
||||
lines=False,
|
||||
titles=False,
|
||||
title_content="目标标题",
|
||||
search=None,
|
||||
context=2
|
||||
)
|
||||
|
||||
content = "# 目标标题\n标题下的内容"
|
||||
|
||||
with patch("sys.exit") as mock_exit:
|
||||
output_result(content, args)
|
||||
mock_exit.assert_not_called()
|
||||
|
||||
captured = capsys.readouterr()
|
||||
assert "目标标题" in captured.out
|
||||
assert "标题下的内容" in captured.out
|
||||
|
||||
def test_output_title_content_not_found(self, capsys):
|
||||
"""测试提取标题内容(未找到)。"""
|
||||
args = argparse.Namespace(
|
||||
count=False,
|
||||
lines=False,
|
||||
titles=False,
|
||||
title_content="不存在的标题",
|
||||
search=None,
|
||||
context=2
|
||||
)
|
||||
|
||||
content = "# 标题1\n内容"
|
||||
|
||||
with patch("sys.exit") as mock_exit:
|
||||
output_result(content, args)
|
||||
mock_exit.assert_called_once_with(1)
|
||||
|
||||
captured = capsys.readouterr()
|
||||
assert "未找到" in captured.out or "错误" in captured.out
|
||||
|
||||
def test_output_search_found(self, capsys):
|
||||
"""测试搜索功能(找到)。"""
|
||||
args = argparse.Namespace(
|
||||
count=False,
|
||||
lines=False,
|
||||
titles=False,
|
||||
title_content=None,
|
||||
search="关键词",
|
||||
context=2
|
||||
)
|
||||
|
||||
content = "行1\n行2\n包含关键词的行\n行4\n行5"
|
||||
|
||||
with patch("sys.exit") as mock_exit:
|
||||
output_result(content, args)
|
||||
mock_exit.assert_not_called()
|
||||
|
||||
captured = capsys.readouterr()
|
||||
assert "关键词" in captured.out
|
||||
|
||||
def test_output_search_not_found(self, capsys):
|
||||
"""测试搜索功能(未找到)。"""
|
||||
args = argparse.Namespace(
|
||||
count=False,
|
||||
lines=False,
|
||||
titles=False,
|
||||
title_content=None,
|
||||
search="不存在的内容",
|
||||
context=2
|
||||
)
|
||||
|
||||
content = "普通内容"
|
||||
|
||||
with patch("sys.exit") as mock_exit:
|
||||
output_result(content, args)
|
||||
mock_exit.assert_called_once_with(1)
|
||||
|
||||
captured = capsys.readouterr()
|
||||
assert "未找到" in captured.out or "错误" in captured.out
|
||||
43
tests/test_readers/test_html_downloader.py
Normal file
43
tests/test_readers/test_html_downloader.py
Normal file
@@ -0,0 +1,43 @@
|
||||
"""测试 HTML 下载器模块。"""
|
||||
|
||||
import pytest
|
||||
from unittest.mock import patch, MagicMock
|
||||
|
||||
from readers.html.downloader import download_html
|
||||
from readers.html.downloader import pyppeteer, selenium, httpx, urllib
|
||||
|
||||
|
||||
class TestDownloadHtml:
|
||||
"""测试 download_html 统一入口函数。"""
|
||||
|
||||
def test_download_html_module_importable(self):
|
||||
"""测试 download_html 函数可以正常导入和调用。"""
|
||||
# 只要不抛异常就可以
|
||||
assert callable(download_html)
|
||||
|
||||
def test_downloaders_available(self):
|
||||
"""测试各下载器模块可用。"""
|
||||
assert callable(pyppeteer.download)
|
||||
assert callable(selenium.download)
|
||||
assert callable(httpx.download)
|
||||
assert callable(urllib.download)
|
||||
|
||||
|
||||
class TestIndividualDownloaders:
|
||||
"""测试单个下载器模块。"""
|
||||
|
||||
def test_pyppeteer_download_callable(self):
|
||||
"""测试 pyppeteer.download 可以调用。"""
|
||||
assert callable(pyppeteer.download)
|
||||
|
||||
def test_selenium_download_callable(self):
|
||||
"""测试 selenium.download 可以调用。"""
|
||||
assert callable(selenium.download)
|
||||
|
||||
def test_httpx_download_callable(self):
|
||||
"""测试 httpx.download 可以调用。"""
|
||||
assert callable(httpx.download)
|
||||
|
||||
def test_urllib_download_callable(self):
|
||||
"""测试 urllib.download 可以调用(标准库)。"""
|
||||
assert callable(urllib.download)
|
||||
46
tests/test_utils/test_encoding_detection.py
Normal file
46
tests/test_utils/test_encoding_detection.py
Normal file
@@ -0,0 +1,46 @@
|
||||
"""测试 encoding_detection 编码检测模块。"""
|
||||
|
||||
import pytest
|
||||
from unittest.mock import patch, MagicMock
|
||||
|
||||
from utils.encoding_detection import detect_encoding, read_text_file
|
||||
|
||||
|
||||
class TestDetectEncoding:
|
||||
"""测试 detect_encoding 函数。"""
|
||||
|
||||
def test_detect_encoding_file_not_exists(self, tmp_path):
|
||||
"""测试文件不存在。"""
|
||||
non_existent = str(tmp_path / "non_existent.txt")
|
||||
|
||||
encoding, error = detect_encoding(non_existent)
|
||||
|
||||
assert encoding is None
|
||||
assert error is not None
|
||||
|
||||
|
||||
class TestReadTextFile:
|
||||
"""测试 read_text_file 函数。"""
|
||||
|
||||
def test_read_simple_file(self, tmp_path):
|
||||
"""测试读取简单文件。"""
|
||||
file_path = tmp_path / "test.txt"
|
||||
content = "test content"
|
||||
file_path.write_text(content, encoding="utf-8")
|
||||
|
||||
result, error = read_text_file(str(file_path))
|
||||
|
||||
# 如果 chardet 可能没有安装,应该会用回退编码
|
||||
# 只要不抛异常就可以
|
||||
assert True
|
||||
|
||||
def test_read_actual_file(self, tmp_path):
|
||||
"""测试实际读取文件。"""
|
||||
file_path = tmp_path / "test.txt"
|
||||
content = "简单测试内容"
|
||||
file_path.write_text(content, encoding="utf-8")
|
||||
|
||||
result, error = read_text_file(str(file_path))
|
||||
|
||||
# 至少应该能读取成功(用回退编码)
|
||||
assert result is not None or error is not None
|
||||
Reference in New Issue
Block a user