We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models. We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora. Our experiments show that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses. Our benchmark is available at https://github.com/IBM/mt-rag-benchmark
翻译:我们提出了MTRAG-UN,这是一个用于探索多轮检索增强生成中开放挑战的基准,后者是大语言模型的一种流行应用。我们发布了一个包含666个任务的基准,涵盖6个领域,包含超过2,800个对话轮次,并附有相应的语料库。我们的实验表明,检索和生成模型在面对包含不可回答、未充分指定、非独立的问题以及不清晰响应的对话时,仍然存在困难。我们的基准可在 https://github.com/IBM/mt-rag-benchmark 获取。