We propose Knowledge-Aware Preprocessing (KAP), a two-stage preprocessing framework tailored for Traditional Chinese non-narrative documents, designed to enhance retrieval accuracy in Hybrid Retrieval systems. Hybrid Retrieval, which integrates Sparse Retrieval (e.g., BM25) and Dense Retrieval (e.g., vector embeddings), has become a widely adopted approach for improving search effectiveness. However, its performance heavily depends on the quality of input text, which is often degraded when dealing with non-narrative documents such as PDFs containing financial statements, contractual clauses, and tables. KAP addresses these challenges by integrating Multimodal Large Language Models (MLLMs) with LLM-driven post-OCR processing, refining extracted text to reduce OCR noise, restore table structures, and optimize text format. By ensuring better compatibility with Hybrid Retrieval, KAP improves the accuracy of both Sparse and Dense Retrieval methods without modifying the retrieval architecture itself.
翻译:我们提出知识感知预处理(KAP),一个专为繁体中文非叙事文档设计的两阶段预处理框架,旨在提升混合检索系统的检索准确率。混合检索通过整合稀疏检索(如BM25)与稠密检索(如向量嵌入),已成为提升搜索效果的广泛采用方法。然而,其性能高度依赖输入文本的质量,而在处理包含财务报表、合同条款及表格等非叙事PDF文档时,文本质量常出现退化。KAP通过将多模态大语言模型(MLLM)与基于大语言模型的OCR后处理相结合,对提取文本进行优化,以降低OCR噪声、重建表格结构并优化文本格式。该框架通过确保与混合检索更好的兼容性,在不修改检索架构的前提下,同步提升了稀疏检索与稠密检索方法的准确性。