As corporations rush to integrate large language models (LLMs) to their search offerings, it is critical that they provide factually accurate information that is robust to any presuppositions that a user may express. In this work, we introduce UPHILL, a dataset consisting of health-related queries with varying degrees of presuppositions. Using UPHILL, we evaluate the factual accuracy and consistency of InstructGPT, ChatGPT, and BingChat models. We find that while model responses rarely disagree with true health claims (posed as questions), they often fail to challenge false claims: responses from InstructGPT agree with 32% of the false claims, ChatGPT 26% and BingChat 23%. As we increase the extent of presupposition in input queries, the responses from InstructGPT and ChatGPT agree with the claim considerably more often, regardless of its veracity. Responses from BingChat, which rely on retrieved webpages, are not as susceptible. Given the moderate factual accuracy, and the inability of models to consistently correct false assumptions, our work calls for a careful assessment of current LLMs for use in high-stakes scenarios.
翻译:随着企业急于将大语言模型(LLMs)整合到其搜索服务中,确保其提供的事实准确信息能够稳健应对用户可能表达的任何预设前提至关重要。本研究提出了UPHILL数据集,该数据集包含具有不同程度预设前提的健康相关查询。我们利用UPHILL评估了InstructGPT、ChatGPT和BingChat模型的事实准确性和一致性。研究发现,尽管模型回答极少与真实健康主张(以问题形式呈现)相矛盾,但它们往往无法反驳虚假主张:InstructGPT、ChatGPT和BingChat的回答分别与32%、26%和23%的虚假主张一致。随着输入查询中预设程度的加深,无论主张的真实性如何,InstructGPT和ChatGPT的回答与主张的一致性显著增加。而依赖检索网页的BingChat回答则不易受影响。鉴于当前模型的事实准确性中等,且无法始终纠正错误假设,我们的工作呼吁对高风险场景中使用的大语言模型进行审慎评估。