The rise of large language models (LLMs) has enabled us to seek answers to inherently debatable questions on LLM chatbots, necessitating a reliable way to evaluate their ability. However, traditional QA benchmarks assume fixed answers are inadequate for this purpose. To address this, we introduce DebateQA, a dataset of 2,941 debatable questions, each accompanied by multiple human-annotated partial answers that capture a variety of perspectives. We develop two metrics: Perspective Diversity, which evaluates the comprehensiveness of perspectives, and Dispute Awareness, which assesses if the LLM acknowledges the question's debatable nature. Experiments demonstrate that both metrics align with human preferences and are stable across different underlying models. Using DebateQA with two metrics, we assess 12 popular LLMs and retrieval-augmented generation methods. Our findings reveal that while LLMs generally excel at recognizing debatable issues, their ability to provide comprehensive answers encompassing diverse perspectives varies considerably.
翻译:大型语言模型(LLM)的兴起使我们能够通过LLM聊天机器人寻求具有内在争议性问题的答案,这需要一种可靠的方法来评估其回答能力。然而,传统的问答基准假设答案固定,无法满足这一需求。为此,我们提出了DebateQA数据集,该数据集包含2,941个可争议问题,每个问题均附带多个人工标注的部分答案,以涵盖不同观点。我们开发了两种评估指标:视角多样性(用于评估观点覆盖的全面性)和争议感知度(用于评估LLM是否认识到问题的争议性)。实验表明,这两种指标均与人类偏好一致,且在不同底层模型中表现稳定。基于DebateQA和这两种指标,我们对12种主流LLM及检索增强生成方法进行了评估。研究发现,虽然LLM普遍擅长识别争议性问题,但其提供涵盖多元视角的全面答案的能力存在显著差异。