Retrieval Augmented Generation (RAG) systems struggle with processing multimodal documents of varying structural complexity. This paper introduces a novel multi-strategy parsing approach using LLM-powered OCR to extract content from diverse document types, including presentations and high text density files both scanned or not. The methodology employs a node-based extraction technique that creates relationships between different information types and generates context-aware metadata. By implementing a Multimodal Assembler Agent and a flexible embedding strategy, the system enhances document comprehension and retrieval capabilities. Experimental evaluations across multiple knowledge bases demonstrate the approach's effectiveness, showing improvements in answer relevancy and information faithfulness.
翻译:检索增强生成系统在处理结构复杂度各异的多模态文档时面临挑战。本文提出一种新颖的多策略解析方法,利用基于大语言模型的光学字符识别技术,从扫描与非扫描的演示文稿、高文本密度文件等多样化文档类型中提取内容。该方法采用基于节点的提取技术,在不同信息类型间建立关联关系,并生成上下文感知的元数据。通过部署多模态组装智能体与灵活的嵌入策略,该系统显著提升了文档理解与检索能力。在多个知识库上进行的实验评估证明了该方法的有效性,在答案相关性与信息保真度方面均显示出显著改进。