Retrieval Augmented Generation (RAG) is widely employed to ground responses to queries on domain-specific documents. But do RAG implementations leave out important information or excessively include irrelevant information? To allay these concerns, it is necessary to annotate domain-specific benchmarks to evaluate information retrieval (IR) performance, as relevance definitions vary across queries and domains. Furthermore, such benchmarks should be cost-efficiently annotated to avoid annotation selection bias. In this paper, we propose DIRAS (Domain-specific Information Retrieval Annotation with Scalability), a manual-annotation-free schema that fine-tunes open-sourced LLMs to annotate relevance labels with calibrated relevance probabilities. Extensive evaluation shows that DIRAS fine-tuned models achieve GPT-4-level performance on annotating and ranking unseen (query, document) pairs, and is helpful for real-world RAG development.
翻译:检索增强生成(RAG)被广泛用于将查询响应基于特定领域文档。但RAG的实现是否会遗漏重要信息或过度包含无关信息?为缓解这些担忧,有必要标注特定领域的基准数据集以评估信息检索(IR)性能,因为相关性定义随查询和领域的不同而变化。此外,此类基准应通过经济高效的标注方式来避免标注选择偏差。本文提出DIRAS(具备可扩展性的领域特定信息检索标注),一种无需人工标注的方案,它通过微调开源大语言模型来标注相关性标签,并输出经过校准的相关性概率。大量评估表明,经DIRAS微调的模型在标注和排序未见过的(查询,文档)对时,达到了GPT-4级别的性能,并对实际RAG开发具有实用价值。