lyxy-document

Author	SHA1	Message	Date
lanyuanxiaoyao	9daff73589	refactor: 调整模块导入路径，简化引用结构 - 更新 openspec/config.yaml 中 git 任务相关说明 - 将 scripts.core.* 改为 core.，scripts.readers. 改为 readers.* - 优化 lyxy_document_reader.py 中 sys.path 设置方式 - 同步更新所有测试文件的导入路径	2026-03-09 15:44:51 +08:00
lanyuanxiaoyao	b80c635f07	refactor: 完善降级链的异常捕获机制为所有 Reader 的 parser 循环添加 try-except 防护层，确保即使 parser 抛出意外异常，降级链也能继续尝试下一个 parser。主要变更： - 所有 Reader (DocxReader/PdfReader/XlsxReader/PptxReader/HtmlReader) 的 parse 方法中添加防护层，捕获意外异常并标记为 [意外异常] - cleaner.clean_html_content() 添加异常处理，返回 (content, error) 元组 - HtmlReader.parse() 更新 cleaner 调用方式，处理新的返回值格式 - BaseReader 添加详细的异常处理规范文档设计原则：双层异常保护 - Parser 层：捕获预期的解析失败（库未安装、格式不支持） - Reader 层：捕获意外的编程错误（NoneType、索引越界等）	2026-03-09 00:26:51 +08:00
lanyuanxiaoyao	09904aefdc	refactor: 移除 BaseReader 中未使用的 supported_extensions 属性从 BaseReader 抽象基类及所有 Reader 子类中移除 supported_extensions 属性，该属性在代码库中从未被实际调用，仅作为元数据存在。	2026-03-08 22:56:32 +08:00
lanyuanxiaoyao	7eab1dcef1	test: 添加全面的测试套件，覆盖所有 Reader 实现 - 测试数量从 83 个增加到 193 个 (+132%) - 代码覆盖率从 48% 提升到 69% (+44%) - 为每种文档格式的所有 Reader 实现创建独立测试 - 添加跨 Reader 的一致性验证测试 - 新增 4 个测试规范 (cli-testing, exception-testing, reader-testing, test-fixtures) - 更新 README 测试统计信息测试覆盖: - DOCX: python-docx, markitdown, docling, native-xml, pypandoc, unstructured - PDF: pypdf, markitdown, docling, docling-ocr, unstructured, unstructured-ocr - HTML: html2text, markitdown, trafilatura, domscribe - PPTX: python-pptx, markitdown, docling, native-xml, unstructured - XLSX: pandas, markitdown, docling, native-xml, unstructured - CLI: 所有命令行选项和错误处理所有 193 个测试通过。	2026-03-08 22:20:21 +08:00
lanyuanxiaoyao	15b63800a8	refactor: 将核心代码迁移到 scripts 目录 - 创建 scripts/ 目录作为核心代码根目录 - 移动 core/, readers/, utils/ 到 scripts/ 下 - 移动 config.py, lyxy_document_reader.py 到 scripts/ - 移动 encoding_detection.py 到 scripts/utils/ - 更新 pyproject.toml 中的入口点路径和 pytest 配置 - 更新所有内部导入语句为 scripts.* 模块 - 更新 README.md 目录结构说明 - 更新 openspec/config.yaml 添加目录结构说明 - 删除无用的 main.py 此变更使项目结构更清晰，便于区分核心代码与测试、文档等支撑文件。	2026-03-08 17:41:03 +08:00

5 Commits