Evaluation plays a crucial role in the advancement of information retrieval (IR) models. However, current benchmarks, which are based on predefined domains and human-labeled data, face limitations in addressing evaluation needs for emerging domains both cost-effectively and efficiently. To address this challenge, we propose the Automated Heterogeneous Information Retrieval Benchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1) Automated. The testing data in AIR-Bench is automatically generated by large language models (LLMs) without human intervention. 2) Heterogeneous. The testing data in AIR-Bench is generated with respect to diverse tasks, domains and languages. 3) Dynamic. The domains and languages covered by AIR-Bench are constantly augmented to provide an increasingly comprehensive evaluation benchmark for community developers. We develop a reliable and robust data generation pipeline to automatically create diverse and high-quality evaluation datasets based on real-world corpora. Our findings demonstrate that the generated testing data in AIR-Bench aligns well with human-labeled testing data, making AIR-Bench a dependable benchmark for evaluating IR models. The resources in AIR-Bench are publicly available at https://github.com/AIR-Bench/AIR-Bench.
翻译:评估在信息检索(IR)模型的发展中起着至关重要的作用。然而,当前基于预定义领域和人工标注数据的基准,在满足新兴领域评估需求方面,难以兼顾成本效益与效率。为应对这一挑战,我们提出了自动化异构信息检索基准(AIR-Bench)。AIR-Bench具有三个关键特征:1)自动化。AIR-Bench中的测试数据由大语言模型(LLMs)自动生成,无需人工干预。2)异构性。AIR-Bench中的测试数据针对多样化的任务、领域和语言生成。3)动态性。AIR-Bench所涵盖的领域和语言持续扩展,旨在为社区开发者提供日益全面的评估基准。我们开发了一个可靠且鲁棒的数据生成流程,能够基于真实世界语料库自动创建多样化且高质量的评估数据集。我们的研究结果表明,AIR-Bench中生成的测试数据与人工标注的测试数据高度吻合,这使AIR-Bench成为评估IR模型的可靠基准。AIR-Bench的相关资源已在 https://github.com/AIR-Bench/AIR-Bench 公开提供。