> For the complete documentation index, see [llms.txt](https://docs.cherry-ai.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.cherry-ai.com/knowledge-base/document-preprocessing.md).

# 文档预处理

知识库文档预处理用于在向量化前对 PDF / 图片等非文本内容做 OCR 与结构解析，让知识库能正确检索这些资料。

### 配置OCR服务商

打开 `设置 → 文档处理`，依次配置：

* **系统 OCR**：macOS 用户开箱即用（无需配置）；Windows 需手动选择 OCR 引擎
* **文档处理服务商**：默认 `MinerU`，可填写 `API Key` + `API Host`（默认 `https://mineru.net`）；也可切换为 Tesseract / Paddle OCR / OpenVINO / 三方 Provider

<figure><img src="/files/EHGqnI2THehsu1rll74R" alt=""><figcaption></figcaption></figure>

点击获取API KEY后会在浏览器打开申请地址，点击立即申请填写表单后获取API KEY，并将其填入API KEY中。

<figure><img src="/files/KAJqeiPHp2BL3WEc4Kky" alt=""><figcaption></figcaption></figure>

### 在知识库中启用文档预处理

<figure><img src="/files/Rnuwui62ZuxVdxyhm59l" alt=""><figcaption></figcaption></figure>

在创建好的知识库设置中打开 **文档预处理** 开关，即可在添加文件时自动使用上一步配置的 OCR Provider。

### 上传文档

<figure><img src="/files/JxVWP6tuSlWe2sdunI7z" alt=""><figcaption></figcaption></figure>

> 可以通过右上角搜索对知识库结果检测

### 在对话中使用

<figure><img src="/files/NTydLleHFIHMfdD2Jha0" alt=""><figcaption></figcaption></figure>

> 知识库使用Tips: 使用**能力较强**的模型时可以将知识库搜索模式修改为意图识别，意图识别可以更准确、广泛的描述您的问题。

### 开启知识库意图识别

<figure><img src="/files/HWawnI8ODW7MJsLNEhXq" alt=""><figcaption></figcaption></figure>

***

### 💡 获取帮助与提交反馈

如果您在配置或使用过程中遇到任何疑问、Bug 或有功能改进建议，请参考 [反馈与建议](/question-contact/suggestions.md) 中提供的官方渠道。


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.cherry-ai.com/knowledge-base/document-preprocessing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
