Sensitivity to false assumptions (or false premises) in information-seeking questions is critical for robust question-answering (QA) systems. Recent work has shown that false assumptions in naturally occurring questions pose challenges to current models, with low performance on both generative QA and simple detection tasks (Kim et al. 2023). However, the focus of existing work on naturally occurring questions leads to a gap in the analysis of model behavior on the long tail of the distribution of possible questions. To this end, we introduce Syn-(QA)$^2$, a set of two synthetically generated QA datasets: one generated using perturbed relations from Wikidata, and the other by perturbing HotpotQA (Yang et al. 2018). Our findings from evaluating a range of large language models are threefold: (1) false assumptions in QA are challenging, echoing the findings of prior work, (2) the binary detection task is challenging even compared to the difficulty of generative QA itself, possibly due to the linguistic structure of the problem, and (3) the detection task is more challenging with long-tail questions compared to naturally occurring questions, highlighting the utility of our synthetic datasets and generation method.
翻译:对信息寻求问题中错误假设(或错误前提)的敏感性对于稳健的问答系统至关重要。近期研究表明,自然产生的问题中的错误假设给当前模型带来了挑战,在生成式问答和简单检测任务上均表现不佳(Kim 等,2023)。然而,现有工作对自然发生问题的关注导致了对可能问题分布长尾上模型行为分析的空白。为此,我们引入了 Syn-(QA)²,这是一组两个合成生成的问答数据集:一个通过扰动维基数据中的关系生成,另一个通过扰动HotpotQA(Yang 等,2018)生成。我们对一系列大规模语言模型的评估结果呈现三方面发现:(1)问答中的错误假设具有挑战性,这与先前研究的结论一致;(2)与生成式问答本身的难度相比,二元检测任务甚至更具挑战性,这可能源于问题的语言结构;(3)与自然发生的问题相比,检测任务在长尾问题上更具挑战性,这凸显了我们合成数据集及生成方法的实用性。