We investigate whether synthetic question-answer (QA) data generated by large language models (LLMs) can serve as an effective proxy for human-labeled benchmarks when the latter is unavailable. We assess the reliability of synthetic benchmarks across two experiments: one varying retriever parameters while keeping the generator fixed, and another varying the generator with fixed retriever parameters. Across four datasets, of which two open-domain and two proprietary, we find that synthetic benchmarks reliably rank the RAGs varying in terms of retriever configuration, aligning well with human-labeled benchmark baselines. However, they do not consistently produce reliable RAG rankings when comparing generator architectures. The breakdown possibly arises from a combination of task mismatch between the synthetic and human benchmarks, and stylistic bias favoring certain generators.
翻译:本研究探讨当人类标注的基准测试不可用时,由大型语言模型(LLM)生成的合成问答(QA)数据能否作为有效的替代评估基准。我们通过两组实验评估合成基准的可靠性:一组实验在保持生成器固定的情况下改变检索器参数,另一组实验在保持检索器参数固定的情况下改变生成器架构。在四个数据集(其中两个为开放域数据集,两个为专有数据集)上的实验结果表明:在评估不同检索器配置的RAG系统时,合成基准能够可靠地对系统进行排序,其结果与人类标注的基准测试基线高度一致。然而,在比较不同生成器架构时,合成基准无法持续产生可靠的RAG系统排序。这种差异可能源于合成基准与人类基准之间的任务不匹配,以及合成数据对特定生成器架构存在的风格偏好偏差。