We present the Benchmark of Information Retrieval (IR) tasks with Complex Objectives (BIRCO). BIRCO evaluates the ability of IR systems to retrieve documents given multi-faceted user objectives. The benchmark's complexity and compact size make it suitable for evaluating large language model (LLM)-based information retrieval systems. We present a modular framework for investigating factors that may influence LLM performance on retrieval tasks, and identify a simple baseline model which matches or outperforms existing approaches and more complex alternatives. No approach achieves satisfactory performance on all benchmark tasks, suggesting that stronger models and new retrieval protocols are necessary to address complex user needs.
翻译:我们提出了面向复杂目标的检索任务基准(BIRCO)。该基准评估信息检索系统在面临多层面用户目标时检索文档的能力。其任务复杂性与紧凑规模使其适合评估基于大语言模型的检索系统。我们提出一种模块化框架,用于研究可能影响大语言模型在检索任务中表现的因素,并确定了一种简单基线模型,该模型可匹配或超越现有方法及更复杂的替代方案。没有任何方法能在所有基准任务上达到令人满意的性能,这表明应对复杂用户需求仍需更强大的模型与新型检索协议。