Go to file

lanyuanxiaoyao aaa1171e60 feat: 添加 --advice 参数，支持快速获取执行建议

- 新增 scripts/core/advice_generator.py 建议生成器模块
- 在 config.py 中添加 DEPENDENCIES 依赖配置
- 在 lyxy_document_reader.py 中添加 -a/--advice 参数
- 复用 Reader 实例的 supports 方法检测文件类型
- 支持平台检测，对 macOS x86_64 PDF 返回特殊命令
- 添加单元测试和集成测试
- 更新 SKILL.md，引导优先使用 --advice 参数
- 更新 README.md，添加项目结构说明
- 添加 openspec/specs/cli-advice/spec.md 规范文档

2026-03-09 18:13:00 +08:00

.claude

chore: 更新 Claude Code 权限设置

2026-03-09 14:39:44 +08:00

.opencode

chore: 初始化 lyxy-document 项目

2026-03-08 11:50:34 +08:00

openspec

feat: 添加 --advice 参数，支持快速获取执行建议

2026-03-09 18:13:00 +08:00

scripts

feat: 添加 --advice 参数，支持快速获取执行建议

2026-03-09 18:13:00 +08:00

tests

feat: 添加 --advice 参数，支持快速获取执行建议

2026-03-09 18:13:00 +08:00

.gitattributes

test: 添加全面的测试套件，覆盖所有 Reader 实现

2026-03-08 22:20:21 +08:00

.gitignore

feat: 添加多平台依赖支持

2026-03-09 10:49:53 +08:00

AGENTS.md

chore: 初始化 lyxy-document 项目

2026-03-08 11:50:34 +08:00

build.py

feat: 添加 PyArmor 代码混淆支持

2026-03-09 14:36:52 +08:00

CLAUDE.md

chore: 初始化 lyxy-document 项目

2026-03-08 11:50:34 +08:00

README.md

feat: 添加 --advice 参数，支持快速获取执行建议

2026-03-09 18:13:00 +08:00

SKILL.md

feat: 添加 --advice 参数，支持快速获取执行建议

2026-03-09 18:13:00 +08:00

README.md

lyxy-document

统一文档解析工具 - 将 DOCX、XLSX、PPTX、PDF、HTML/URL 转换为 Markdown

开发环境

使用 uv 运行脚本和测试，禁用主机 Python
依赖管理：使用 uv run --with 按需加载依赖
快速获取建议：使用 -a/--advice 参数查看执行命令，无需手动查找依赖

项目结构

scripts/                    # 核心代码
├── core/                   # 核心模块
│   ├── advice_generator.py # 执行建议生成器（新增）
│   ├── parser.py           # 解析调度
│   ├── exceptions.py       # 异常定义
│   └── markdown.py         # Markdown 工具
├── readers/                # 格式阅读器
├── utils/                  # 工具函数
└── config.py               # 配置（含 DEPENDENCIES 依赖配置）
tests/                      # 测试
openspec/                   # 规范文档
skill/                      # SKILL 文档

开发工作流

使用 uv run --with 方式运行测试和开发工具：

# 运行测试（需要先安装 pytest）
uv run \
  --with pytest \
  --with pytest-cov \
  --with chardet \
  pytest

# 运行测试并查看覆盖率
uv run \
  --with pytest \
  --with pytest-cov \
  --with chardet \
  pytest --cov=scripts --cov-report=term-missing

# 运行特定测试文件
uv run \
  --with pytest \
  --with chardet \
  pytest tests/test_readers/test_docx/

# 运行特定测试类或方法
uv run \
  --with pytest \
  --with chardet \
  pytest tests/test_cli/test_main.py::TestCLIDefaultOutput::test_default_output_docx

# 代码格式化
uv run \
  --with black \
  --with isort \
  --with chardet \
  bash -c "black . && isort ."

# 类型检查
uv run \
  --with mypy \
  --with chardet \
  mypy .

测试 DOCX reader：

uv run \
  --with pytest \
  --with docling \
  --with "unstructured[docx]" \
  --with "markitdown[docx]" \
  --with pypandoc-binary \
  --with python-docx \
  --with markdownify \
  --with chardet \
  pytest tests/test_readers/test_docx/

测试 PDF reader：

# 默认命令（macOS ARM、Linux、Windows）
uv run \
  --with pytest \
  --with docling \
  --with "unstructured[pdf]" \
  --with "markitdown[pdf]" \
  --with pypdf \
  --with markdownify \
  --with chardet \
  pytest tests/test_readers/test_pdf/

# macOS x86_64 (Intel) 特殊命令
uv run \
  --python 3.12 \
  --with pytest \
  --with "docling==2.40.0" \
  --with "docling-parse==4.0.0" \
  --with "numpy<2" \
  --with "markitdown[pdf]" \
  --with pypdf \
  --with markdownify \
  --with chardet \
  pytest tests/test_readers/test_pdf/

测试其他格式：

# XLSX reader
uv run \
  --with pytest \
  --with docling \
  --with "unstructured[xlsx]" \
  --with "markitdown[xlsx]" \
  --with pandas \
  --with tabulate \
  --with chardet \
  pytest tests/test_readers/test_xlsx/

# PPTX reader
uv run \
  --with pytest \
  --with docling \
  --with "unstructured[pptx]" \
  --with "markitdown[pptx]" \
  --with python-pptx \
  --with markdownify \
  --with chardet \
  pytest tests/test_readers/test_pptx/

# HTML reader
uv run \
  --with pytest \
  --with trafilatura \
  --with domscribe \
  --with markitdown \
  --with html2text \
  --with beautifulsoup4 \
  --with httpx \
  --with chardet \
  pytest tests/test_readers/test_html/

测试

项目包含完整的测试套件,覆盖 CLI 和所有 Reader 实现:

测试覆盖率: 69%
测试数量: 193 个测试
测试类型:
- CLI 功能测试（字数统计、行数统计、标题提取、搜索等）
- Reader 解析测试（DOCX、PDF、HTML、PPTX、XLSX）
- 多 Reader 实现测试（每种格式测试多个解析库）
- 异常场景测试（文件不存在、空文件、损坏文件、特殊字符）
- 编码测试（GBK、UTF-8 BOM 等）
- 一致性测试（验证不同 Reader 解析结果的一致性）

运行测试前，请根据测试类型使用 uv run --with 安装对应的依赖包。详见上方的"开发工作流"章节。

代码规范

语言：仅中文（交流、注释、文档、代码）
模块文件：150-300 行
错误处理：自定义异常 + 清晰信息 + 位置上下文
Git 提交：类型: 简短描述（feat/fix/refactor/docs/style/test/chore）

Skill 文档规范

skill/SKILL.md 面向 AI 用户，必须遵循 Claude Skill 构建指南的最佳实践：

YAML frontmatter

name: kebab-case 格式
description: 包含功能说明、触发词、文件类型、典型任务
license: MIT
metadata: 包含 version、author
compatibility: 说明 Python 版本要求和依赖情况

文档章节结构

Purpose: 说明统一入口和双路径执行策略
When to Use: 典型场景和触发词列表（中英文、文件扩展名）
Quick Reference: 命令参数表格
Workflow: 4 步工作流程（检测环境、识别类型、执行解析、输出结果）
使用示例: 各文档类型的基本用法和高级用法
错误处理: 常见错误及解决方案
References: 指向项目文档的链接

依赖管理

使用 uv run --with 方式按需加载依赖
必须使用具体的 pip 包名
使用 -a/--advice 参数可快速获取针对具体文件的执行命令

解析器架构

DOCX

docling、unstructured、pypandoc-binary、MarkItDown、python-docx、XML

XLSX

docling、unstructured、MarkItDown、pandas、XML

PPTX

docling、unstructured、MarkItDown、python-pptx、XML

PDF（OCR 优先）

docling OCR、unstructured OCR、docling、unstructured、MarkItDown、pypdf

HTML/URL

trafilatura、domscribe、MarkItDown、html2text

许可证

MIT License

README.md Unescape Escape

lyxy-document

开发环境

项目结构

开发工作流

测试

代码规范

Skill 文档规范

YAML frontmatter

文档章节结构

依赖管理

解析器架构

DOCX

XLSX

PPTX

PDF（OCR 优先）

HTML/URL

许可证

README.md