In enterprise settings, efficiently retrieving relevant information from large and complex knowledge bases is essential for operational productivity and informed decision-making. This research presents a systematic framework for metadata enrichment using large language models (LLMs) to enhance document retrieval in Retrieval-Augmented Generation (RAG) systems. Our approach employs a comprehensive, structured pipeline that dynamically generates meaningful metadata for document segments, substantially improving their semantic representations and retrieval accuracy. Through extensive experiments, we compare three chunking strategies-semantic, recursive, and naive-and evaluate their effectiveness when combined with advanced embedding techniques. The results demonstrate that metadata-enriched approaches consistently outperform content-only baselines, with recursive chunking paired with TF-IDF weighted embeddings yielding an 82.5% precision rate compared to 73.3% for semantic content-only approaches. The naive chunking strategy with prefix-fusion achieved the highest Hit Rate@10 of 0.925. Our evaluation employs cross-encoder reranking for ground truth generation, enabling rigorous assessment via Hit Rate and Metadata Consistency metrics. These findings confirm that metadata enrichment enhances vector clustering quality while reducing retrieval latency, making it a key optimization for RAG systems across knowledge domains. This work offers practical insights for deploying high-performance, scalable document retrieval solutions in enterprise settings, demonstrating that metadata enrichment is a powerful approach for enhancing RAG effectiveness.
翻译:在企业环境中,从庞大复杂的知识库中高效检索相关信息对运营效率和决策制定至关重要。本研究提出了一种利用大语言模型(LLMs)进行元数据增强的系统化框架,以提升检索增强生成(RAG)系统中的文档检索效果。我们的方法采用全面结构化的流程,动态为文档片段生成有意义的元数据,显著改善其语义表示和检索准确率。通过大量实验,我们比较了三种分块策略——语义分块、递归分块和朴素分块——并评估它们与先进嵌入技术结合时的有效性。结果表明,元数据增强方法始终优于仅基于内容的基线方法,其中递归分块结合TF-IDF加权嵌入实现了82.5%的精确率,而仅使用语义内容的方法为73.3%。采用前缀融合的朴素分块策略达到了最高的Hit Rate@10值(0.925)。我们的评估采用交叉编码器重排序生成基准真值,通过命中率和元数据一致性指标进行严格评估。这些发现证实,元数据增强能提升向量聚类质量并降低检索延迟,使其成为跨知识领域RAG系统的关键优化手段。本研究为企业部署高性能、可扩展的文档检索解决方案提供了实用见解,证明元数据增强是提升RAG效能的强效途径。