In real-world information-seeking scenarios, users have dynamic and diverse needs, requiring RAG systems to demonstrate adaptable resilience. To comprehensively evaluate the resilience of current RAG methods, we introduce HawkBench, a human-labeled, multi-domain benchmark designed to rigorously assess RAG performance across categorized task types. By stratifying tasks based on information-seeking behaviors, HawkBench provides a systematic evaluation of how well RAG systems adapt to diverse user needs. Unlike existing benchmarks, which focus primarily on specific task types (mostly factoid queries) and rely on varying knowledge bases, HawkBench offers: (1) systematic task stratification to cover a broad range of query types, including both factoid and rationale queries, (2) integration of multi-domain corpora across all task types to mitigate corpus bias, and (3) rigorous annotation for high-quality evaluation. HawkBench includes 1,600 high-quality test samples, evenly distributed across domains and task types. Using this benchmark, we evaluate representative RAG methods, analyzing their performance in terms of answer quality and response latency. Our findings highlight the need for dynamic task strategies that integrate decision-making, query interpretation, and global knowledge understanding to improve RAG generalizability. We believe HawkBench serves as a pivotal benchmark for advancing the resilience of RAG methods and their ability to achieve general-purpose information seeking.
翻译:在现实世界的信息检索场景中,用户的需求是动态且多样的,这要求RAG系统展现出可适应的鲁棒性。为了全面评估当前RAG方法的鲁棒性,我们引入了HawkBench,这是一个人工标注、多领域的基准测试集,旨在严格评估RAG方法在分类任务类型上的性能。通过基于信息检索行为对任务进行分层,HawkBench为RAG系统如何适应多样化用户需求提供了系统性评估。与现有主要关注特定任务类型(多为事实型查询)并依赖不同知识库的基准测试不同,HawkBench提供了:(1)系统化的任务分层,以覆盖广泛的查询类型,包括事实型查询和推理型查询;(2)在所有任务类型中整合多领域语料库,以减轻语料库偏差;(3)严格的标注以确保高质量评估。HawkBench包含1600个高质量测试样本,均匀分布在各个领域和任务类型中。利用此基准测试,我们评估了代表性的RAG方法,分析了它们在答案质量和响应延迟方面的表现。我们的研究结果强调了需要整合决策制定、查询解释和全局知识理解的动态任务策略,以提高RAG的泛化能力。我们相信HawkBench将作为一个关键基准,推动RAG方法鲁棒性的提升及其实现通用信息检索的能力。