Resolving knowledge conflicts is a crucial challenge in Question Answering (QA) tasks, as the internet contains numerous conflicting facts and opinions. While some research has made progress in tackling ambiguous settings where multiple valid answers exist, these approaches often neglect to provide source citations, leaving users to evaluate the factuality of each answer. On the other hand, existing work on citation generation has focused on unambiguous settings with single answers, failing to address the complexity of real-world scenarios. Despite the importance of both aspects, no prior research has combined them, leaving a significant gap in the development of QA systems. In this work, we bridge this gap by proposing the novel task of QA with source citation in ambiguous settings, where multiple valid answers exist. To facilitate research in this area, we create a comprehensive framework consisting of: (1) five novel datasets, obtained by augmenting three existing reading comprehension datasets with citation meta-data across various ambiguous settings, such as distractors and paraphrasing; (2) the first ambiguous multi-hop QA dataset featuring real-world, naturally occurring contexts; (3) two new metrics to evaluate models' performances; and (4) several strong baselines using rule-based, prompting, and finetuning approaches over five large language models. We hope that this new task, datasets, metrics, and baselines will inspire the community to push the boundaries of QA research and develop more trustworthy and interpretable systems.
翻译:解决知识冲突是问答任务中的关键挑战,因为互联网中存在大量相互矛盾的事实与观点。尽管已有研究在存在多个有效答案的模糊场景处理方面取得进展,但这些方法往往忽略提供来源引用,导致用户需自行评估每个答案的事实性。另一方面,现有引用生成研究主要关注具有单一答案的明确场景,未能应对现实场景的复杂性。尽管两方面均至关重要,先前研究尚未将二者结合,这在问答系统发展中形成了显著空白。本研究通过提出模糊场景下带源引的问答新任务来弥合这一空白,该场景中存在多个有效答案。为推进该领域研究,我们构建了包含以下要素的综合框架:(1)五个新颖数据集,通过对三个现有阅读理解数据集进行增强,涵盖干扰项与改写等多种模糊场景的引用元数据;(2)首个具有真实自然语境特征的模糊多跳问答数据集;(3)两个评估模型性能的新指标;(4)基于规则方法、提示工程及微调策略在五个大语言模型上构建的若干强基线。我们期望这一新任务、数据集、评估指标与基线能够推动学界拓展问答研究的边界,开发更具可信度与可解释性的系统。