# 解析器说明和依赖安装

## 多策略解析降级

每种文件格式配备多个解析器，按优先级依次尝试，前一个失败自动回退到下一个。

详细的解析器优先级和对比请查阅 `scripts/README.md`。

## 依赖安装

### 使用 uv（推荐）

```bash
# DOCX - 全依赖
uv run --with docling --with "unstructured[docx]" --with markdownify --with pypandoc-binary --with "markitdown[docx]" --with python-docx scripts/parser.py /path/to/file.docx

# PPTX - 全依赖
uv run --with docling --with "unstructured[pptx]" --with markdownify --with "markitdown[pptx]" --with python-pptx scripts/parser.py /path/to/file.pptx

# XLSX - 全依赖
uv run --with docling --with "unstructured[xlsx]" --with markdownify --with "markitdown[xlsx]" --with pandas --with tabulate scripts/parser.py /path/to/file.xlsx

# PDF - 全依赖（基础文本提取）
uv run --with docling --with "unstructured[pdf]" --with markdownify --with "markitdown[pdf]" --with pypdf scripts/parser.py /path/to/file.pdf

# PDF OCR 高精度模式（全依赖）
uv run --with docling --with "unstructured[pdf]" --with unstructured-paddleocr --with "paddlepaddle==2.6.2" --with ml-dtypes --with markdownify --with "markitdown[pdf]" --with pypdf scripts/parser.py /path/to/file.pdf --high-res
```

> **说明**：以上为全依赖安装命令，包含所有解析器以获得最佳兼容性。详细的解析器优先级和对比请查阅 `scripts/README.md`。

## 各格式输出特点

- **DOCX**：标准 Markdown 文档结构
- **PPTX**：每张幻灯片以 `## Slide N` 为标题，幻灯片之间以 `---` 分隔
- **XLSX**：以 `## SheetName` 区分工作表，数据以 Markdown 表格呈现
- **PDF**：纯文本流，使用 `--high-res` 可启用 OCR 版面分析识别标题

## 能力说明

### 1. 全文转换为 Markdown
将完整文档解析为 Markdown 格式，移除图片但保留文本格式（标题、列表、表格、粗体、斜体等）。

### 2. 获取文档元信息
- 字数统计（`-c` 参数）
- 行数统计（`-l` 参数）

### 3. 标题列表提取
提取文档中所有 1-6 级标题（`-t` 参数），按原始层级关系返回。

### 4. 指定章节内容提取
根据标题名称提取特定章节的完整内容（`-tc` 参数），包含上级标题链和所有下级内容。

### 5. 正则表达式搜索
在文档中搜索关键词或模式（`-s` 参数），支持自定义上下文行数（`-n` 参数，默认 2 行）。

### 6. PDF OCR 高精度模式
对 PDF 文件启用 OCR 版面分析（`--high-res` 参数），适用于扫描版 PDF 或需要识别标题层级的场景。