The ability to derive useful information by asking clarifying questions (ACQ) is an important element of real life collaboration on reasoning tasks, such as question answering (QA). Existing natural language ACQ challenges, however, evaluate generations based on word overlap rather than the value of the information itself. Word overlap is often an inappropriate metric for question generation since many different questions could be useful in a given situation, and a single question can be phrased many different ways. Instead, we propose evaluating questions pragmatically based on the value of the information they retrieve. Here we present a definition and framework for natural language pragmatic asking of clarifying questions (PACQ), the problem of generating questions that result in answers useful for a reasoning task. We also present fact-level masking (FLM), a procedure for converting natural language datasets into self-supervised PACQ datasets by omitting particular critical facts. Finally, we generate a PACQ dataset from the HotpotQA dataset using FLM and evaluate several zero-shot language models on it. Our experiments show that current zero-shot models struggle to ask questions that retrieve useful information, as compared to human annotators. These results demonstrate an opportunity to use FLM datasets and the PACQ framework to objectively evaluate and improve question generation and other language models.
翻译:通过提出澄清问题获取有用信息的能力是现实生活推理任务协作(如问答)的重要元素。然而,现有自然语言澄清问题挑战仍基于词汇重叠而非信息本身的价值进行评估。词汇重叠通常不适用于问题生成的评估,因为同一情境下许多不同问题都可能有效,且单一问题可有多种措辞方式。我们提出基于所获取信息价值对问题进行语用评估的全新方法。本文定义了自然语言语用澄清问题框架,即生成能为推理任务提供有用答案的问题。同时提出事实遮蔽机制,通过省略特定关键事实将自然语言数据集转化为自监督语用澄清问题数据集。最后,我们利用事实遮蔽从HotpotQA数据集生成语用澄清问题数据集,并在该数据集上评估多个零样本语言模型。实验表明,当前零样本模型在提出能获取有用信息的问题方面仍逊于人类标注者。这些结果证明,可利用事实遮蔽数据集与语用澄清问题框架客观评估并改进问题生成及其他语言模型。