Generative AI models face the challenge of hallucinations that can undermine users' trust in such systems. We approach the problem of conversational information seeking as a two-step process, where relevant passages in a corpus are identified first and then summarized into a final system response. This way we can automatically assess if the answer to the user's question is present in the corpus. Specifically, our proposed method employs a sentence-level classifier to detect if the answer is present, then aggregates these predictions on the passage level, and eventually across the top-ranked passages to arrive at a final answerability estimate. For training and evaluation, we develop a dataset based on the TREC CAsT benchmark that includes answerability labels on the sentence, passage, and ranking levels. We demonstrate that our proposed method represents a strong baseline and outperforms a state-of-the-art LLM on the answerability prediction task.
翻译:生成式人工智能模型面临幻觉的挑战,这可能削弱用户对此类系统的信任。我们将对话式信息寻求问题视为一个两步过程:首先识别语料库中的相关段落,然后将其总结为最终的系统回复。通过这种方式,我们可以自动评估用户问题的答案是否存在于语料库中。具体而言,我们提出的方法采用句子级分类器来检测答案是否存在,随后在段落级别聚合这些预测,并最终在排名靠前的段落之间进行整合,以得出最终的答案可能性估计。为了训练和评估,我们基于TREC CAsT基准构建了一个数据集,其中包含句子、段落和排序级别的答案可能性标签。我们证明,所提出的方法代表了一个强大的基准,并在答案可能性预测任务上优于当前最先进的大语言模型。