Go to file

lanyuanxiaoyao 78063b9e07 fix: 修正 pyarmor_runtime 目录位置到 scripts 内部

- 修改 build.py 中混淆后文件移动逻辑，先移动 scripts 目录，再将 pyarmor_runtime 移动到 scripts 内部
- 更新 spec.md 中关于混淆后文件结构的描述
- 更新 config.yaml 中测试规范，强调严禁跳过测试

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-15 12:42:08 +08:00

.claude

feat: 添加 doc/xls/ppt 旧格式文档静态测试文件支持

2026-03-11 00:30:47 +08:00

.opencode

chore: 初始化 lyxy-document 项目

2026-03-08 11:50:34 +08:00

openspec

fix: 修正 pyarmor_runtime 目录位置到 scripts 内部

2026-03-15 12:42:08 +08:00

scripts

fix: 支持从任意路径调用 lyxy_document_reader.py

2026-03-15 12:06:44 +08:00

tests

fix: 支持从任意路径调用 lyxy_document_reader.py

2026-03-15 12:06:44 +08:00

.gitattributes

feat: 添加 doc/xls/ppt 旧格式文档静态测试文件支持

2026-03-11 00:30:47 +08:00

.gitignore

feat: 添加多平台依赖支持

2026-03-09 10:49:53 +08:00

AGENTS.md

chore: 初始化 lyxy-document 项目

2026-03-08 11:50:34 +08:00

build.py

fix: 修正 pyarmor_runtime 目录位置到 scripts 内部

2026-03-15 12:42:08 +08:00

CLAUDE.md

chore: 初始化 lyxy-document 项目

2026-03-08 11:50:34 +08:00

publish.py

feat: 添加 skill 发布功能和混淆构建优化

2026-03-11 12:22:46 +08:00

publish.sh

feat: 添加 skill 发布功能和混淆构建优化

2026-03-11 12:22:46 +08:00

README.md

feat: 添加自启动机制，移除 --advice 参数

2026-03-11 23:49:39 +08:00

SKILL.md

feat: 添加自启动机制，移除 --advice 参数

2026-03-11 23:49:39 +08:00

README.md

lyxy-document

统一文档解析工具 - 将 DOCX、XLS、XLSX、PPTX、PDF、HTML/URL 转换为 Markdown

项目概述

面向 AI Skill 的统一文档解析工具，支持多种文档格式解析为 Markdown，提供全文输出、字数统计、标题提取、内容搜索等功能。

开发环境

使用 uv 运行脚本和测试，禁用主机 Python
依赖管理：使用 uv run --with 按需加载依赖
自启动机制：脚本自动检测依赖并用正确的 uv 命令执行

项目架构

scripts/
├── lyxy_document_reader.py    # CLI 入口（自启动）
├── bootstrap.py                # 实际执行模块
├── config.py                   # 配置（含 DEPENDENCIES 依赖配置）
├── core/                       # 核心模块
│   ├── parser.py              # 解析调度
│   ├── advice_generator.py    # 依赖检测和配置生成
│   ├── markdown.py            # Markdown 工具
│   └── exceptions.py          # 异常定义
├── readers/                    # 格式阅读器
│   ├── base.py                # Reader 基类
│   ├── docx/                  # DOCX 解析器
│   ├── xls/                   # XLS 解析器（旧格式）
│   ├── xlsx/                  # XLSX 解析器
│   ├── pptx/                  # PPTX 解析器
│   ├── pdf/                   # PDF 解析器
│   └── html/                  # HTML/URL 解析器
└── utils/                      # 工具函数
    ├── file_detection.py      # 文件检测
    └── encoding_detection.py  # 编码检测

tests/                           # 测试套件
├── test_readers/               # Reader 测试
│   └── fixtures/               # 静态测试文件（Git LFS 管理）
│       └── xls/                # XLS 旧格式测试文件
openspec/                        # OpenSpec 规范文档
build.py                         # 构建脚本（混淆模式）
publish.py                       # 发布脚本
publish.sh                       # 一键构建+发布
README.md                        # 本文档（开发者文档）
SKILL.md                         # AI Skill 文档

测试 Fixtures 规范

静态测试文件目录

tests/test_readers/fixtures/ 目录用于存放预先准备的静态测试文件，特别是难以通过 Python 自动化创建的旧格式文件（.xls）。

目录使用规则

仅存放静态文件：该目录下的文件必须是预先准备好的，禁止在测试运行时向该目录动态生成临时文件。
临时文件使用 tmp_path：测试中需要临时文件时，使用 pytest 的 tmp_path fixture 在其他位置创建。
Git LFS 管理：该目录下所有文件通过 Git LFS 管理，见 .gitattributes 配置。

Fixture 说明

tests/test_readers/conftest.py 提供以下静态文件 fixtures：

目录路径：xls_fixture_path
单个文件：simple_xls_path 等

文件不存在时会自动 pytest.skip()，保证 CI 稳定性。

核心概念

Reader 机制

每种文档格式对应一个 Reader 包，包含多个解析实现。Reader 基类定义 supports() 和 parse() 方法，解析器按顺序尝试，第一个成功的结果返回。

依赖配置 (config.DEPENDENCIES)

按文件类型和平台组织依赖配置：

DEPENDENCIES = {
    "pdf": {
        "default": {
            "python": None,
            "dependencies": ["docling", "unstructured[pdf]", ...]
        },
        "Darwin-x86_64": {
            "python": "3.12",
            "dependencies": ["docling==2.40.0", ...]
        }
    },
    ...
}

自启动机制

入口脚本根据文件扩展名识别类型，检测当前平台，从 config.DEPENDENCIES 读取对应配置，自动生成并执行正确的 uv run --with 命令。

快速开始

验证环境

首先验证项目可以正常运行：

# 测试解析功能（自动检测依赖并执行）
python scripts/lyxy_document_reader.py "https://example.com"

运行基础测试

# 运行 CLI 测试（验证项目基本功能）
uv run \
  --with pytest \
  pytest tests/test_cli/ -v

开发指南

测试前置依赖说明

由于 HtmlReader 模块在导入时会加载 cleaner.py，但 cleaner.py 中的第三方库已改为动态导入，因此无需额外依赖。

beautifulsoup4 和 chardet 仅在实际使用 HTML 功能时才需要，模块导入时不依赖。

如何添加新的 Reader

在 scripts/readers/ 下创建新目录
继承 BaseReader 实现 supports() 和 parse()
在 scripts/readers/__init__.py 中注册
在 config.DEPENDENCIES 中添加依赖配置

如何测试

项目包含完整的测试套件，覆盖 CLI 和所有 Reader 实现。根据测试类型使用对应的 uv run --with 命令。

运行所有测试

uv run \
  --with pytest \
  --with pytest-cov \
  pytest

测试 DOCX reader

uv run \
  --with pytest \
  --with docling \
  --with "unstructured[docx]" \
  --with "markitdown[docx]" \
  --with pypandoc-binary \
  --with python-docx \
  --with markdownify \
  pytest tests/test_readers/test_docx/

测试 XLSX reader

uv run \
  --with pytest \
  --with docling \
  --with "unstructured[xlsx]" \
  --with "markitdown[xlsx]" \
  --with pandas \
  --with tabulate \
  pytest tests/test_readers/test_xlsx/

测试 PPTX reader

uv run \
  --with pytest \
  --with docling \
  --with "unstructured[pptx]" \
  --with "markitdown[pptx]" \
  --with python-pptx \
  --with markdownify \
  pytest tests/test_readers/test_pptx/

测试 PDF reader

# 默认命令（macOS ARM、Linux、Windows）
uv run \
  --with pytest \
  --with docling \
  --with "unstructured[pdf]" \
  --with "markitdown[pdf]" \
  --with pypdf \
  --with markdownify \
  --with reportlab \
  pytest tests/test_readers/test_pdf/

# macOS x86_64 (Intel) 特殊命令
uv run \
  --python 3.12 \
  --with pytest \
  --with "docling==2.40.0" \
  --with "docling-parse==4.0.0" \
  --with "numpy<2" \
  --with "markitdown[pdf]" \
  --with pypdf \
  --with markdownify \
  --with reportlab \
  pytest tests/test_readers/test_pdf/

测试 HTML reader

uv run \
  --with pytest \
  --with trafilatura \
  --with domscribe \
  --with markitdown \
  --with html2text \
  --with beautifulsoup4 \
  --with httpx \
  --with chardet \
  pytest tests/test_readers/test_html/

测试 XLS reader（旧格式，使用静态文件）

uv run \
  --with pytest \
  --with "unstructured[xlsx]" \
  --with "markitdown[xls]" \
  --with pandas \
  --with tabulate \
  --with xlrd \
  pytest tests/test_readers/test_xls/

运行特定测试文件或方法

# 运行特定测试文件（CLI 测试无需额外依赖）
uv run \
  --with pytest \
  pytest tests/test_cli/test_main.py

# 运行特定测试类或方法
uv run \
  --with pytest \
  --with docling \
  pytest tests/test_cli/test_main.py::TestCLIDefaultOutput::test_default_output_docx

查看测试覆盖率

uv run \
  --with pytest \
  --with pytest-cov \
  pytest --cov=scripts --cov-report=term-missing

代码规范

语言：仅中文（交流、注释、文档、代码）
模块文件：150-300 行
错误处理：自定义异常 + 清晰信息 + 位置上下文
Git 提交：类型: 简短描述（feat/fix/refactor/docs/style/test/chore）

构建与发布

构建脚本

项目提供 build.py 用于构建 Skill 包，使用 PyArmor 进行代码混淆：

uv run --with pyarmor python build.py

构建产物输出到 build/ 目录，包含：

SKILL.md（动态注入 version 和 author）
scripts/（混淆后的代码）

发布脚本

提供 publish.py 用于自动发布到目标仓库：

uv run python publish.py

发布流程：

在临时目录 clone https://github.com/lanyuanxiaoyao/skills.git（--depth 1）
清空 skills/lyxy-document-reader/ 目录
复制 build/ 内容到目标路径
Git 提交并推送

一键发布

使用 publish.sh 一键完成构建+发布：

./publish.sh

文档说明

README.md（本文档）：面向项目开发者
SKILL.md：面向 AI 使用的 Skill 文档
openspec/：OpenSpec 规范文档

许可证

MIT License

README.md Unescape Escape

lyxy-document

项目概述

开发环境

项目架构

测试 Fixtures 规范

静态测试文件目录

目录使用规则

Fixture 说明

核心概念

Reader 机制

依赖配置 (config.DEPENDENCIES)

自启动机制

快速开始

验证环境

运行基础测试

开发指南

测试前置依赖说明

如何添加新的 Reader

如何测试

运行所有测试

测试 DOCX reader

测试 XLSX reader

测试 PPTX reader

测试 PDF reader

测试 HTML reader

测试 XLS reader（旧格式，使用静态文件）

运行特定测试文件或方法

查看测试覆盖率

代码规范

构建与发布

构建脚本

发布脚本

一键发布

文档说明

许可证

README.md