We present the Benchmark of Information Retrieval (IR) tasks with Complex Objectives (BIRCO). BIRCO evaluates the ability of IR systems to retrieve documents given multi-faceted user objectives. The benchmark's complexity and compact size make it suitable for evaluating large language model (LLM)-based information retrieval systems. We present a modular framework for investigating factors that may influence LLM performance on retrieval tasks, and identify a simple baseline model which matches or outperforms existing approaches and more complex alternatives. No approach achieves satisfactory performance on all benchmark tasks, suggesting that stronger models and new retrieval protocols are necessary to address complex user needs.
翻译:我们提出了面向复杂目标的信息检索任务基准(BIRCO)。该基准旨在评估信息检索系统在应对多层面用户目标时检索文档的能力。其任务复杂性与精巧的规模使其特别适用于评估基于大语言模型的信息检索系统。我们构建了一个模块化研究框架,用以分析影响大语言模型在检索任务中表现的关键因素,并发现了一个简单基线模型——该模型性能可媲美甚至超越现有方法及更复杂的替代方案。然而,尚无任何方法能在所有基准任务上取得令人满意的表现,这表明需开发更强大的模型与新型检索协议来满足复杂用户需求。