AnnoRetrieve: Efficient Structured Retrieval for Unstructured Document Analysis

Unstructured documents dominate enterprise and web data, but their lack of explicit organization hinders precise information retrieval. Current mainstream retrieval methods, especially embedding-based vector search, rely on coarse-grained semantic similarity, incurring high computational cost and frequent LLM calls for post-processing. To address this critical issue, we propose AnnoRetrieve, a novel retrieval paradigm that shifts from embeddings to structured annotations, enabling precise, annotation-driven semantic retrieval. Our system replaces expensive vector comparisons with lightweight structured queries over automatically induced schemas, dramatically reducing LLM usage and overall cost. The system integrates two synergistic core innovations: SchemaBoot, which automatically generates document annotation schemas via multi-granularity pattern discovery and constraint-based optimization, laying a foundation for annotation-driven retrieval and eliminating manual schema design, and Structured Semantic Retrieval (SSR), the core retrieval engine, which unifies semantic understanding with structured query execution; by leveraging the annotated structure instead of vector embeddings, SSR achieves precise semantic matching, seamlessly completing attribute-value extraction, table generation, and progressive SQL-based reasoning without relying on LLM interventions. This annotation-driven paradigm overcomes the limitations of traditional vector-based methods with coarse-grained matching and heavy LLM dependency and graph-based methods with high computational overhead. Experiments on three real-world datasets confirm that AnnoRetrieve significantly lowers LLM call frequency and retrieval cost while maintaining high accuracy. AnnoRetrieve establishes a new paradigm for cost-effective, precise, and scalable document analysis through intelligent structuring.

翻译：非结构化文档在企业与网络数据中占据主导地位，但其缺乏显式组织的特性阻碍了精确信息检索。当前主流检索方法，尤其是基于嵌入向量的语义搜索，依赖粗粒度语义相似度，导致高计算开销及频繁调用大语言模型进行后处理。为解决这一关键问题，我们提出AnnoRetrieve——一种从嵌入向量转向结构化标注的新型检索范式，实现精确的标注驱动语义检索。该系统通过轻量级结构化查询替代昂贵的向量比较操作，在自动归纳的模式Schema上执行查询，大幅降低大语言模型使用频率与总体成本。系统整合两项协同核心创新：SchemaBoot通过多粒度模式发现与约束优化自动生成文档标注模式，为标注驱动检索奠定基础并消除人工模式设计；结构化语义检索引擎将语义理解与结构化查询执行统一，利用标注结构而非向量嵌入实现精确语义匹配，无需大语言模型干预即可自动完成属性-值提取、表格生成及渐进式SQL推理。该标注驱动范式克服了传统向量方法粗粒度匹配与高LLM依赖性的局限，以及图方法计算开销大的缺陷。在三个真实数据集上的实验证实，AnnoRetrieve在保持高精度的同时显著降低了大语言模型调用频率与检索成本，通过智能结构化建立了经济高效、精确且可扩展的文档分析新范式。