Evaluating Large Language Models for Health-related Queries with Presuppositions

As corporations rush to integrate large language models (LLMs) to their search offerings, it is critical that they provide factually accurate information that is robust to any presuppositions that a user may express. In this work, we introduce UPHILL, a dataset consisting of health-related queries with varying degrees of presuppositions. Using UPHILL, we evaluate the factual accuracy and consistency of InstructGPT, ChatGPT, and BingChat models. We find that while model responses rarely disagree with true health claims (posed as questions), they often fail to challenge false claims: responses from InstructGPT agree with 32% of the false claims, ChatGPT 26% and BingChat 23%. As we increase the extent of presupposition in input queries, the responses from InstructGPT and ChatGPT agree with the claim considerably more often, regardless of its veracity. Responses from BingChat, which rely on retrieved webpages, are not as susceptible. Given the moderate factual accuracy, and the inability of models to consistently correct false assumptions, our work calls for a careful assessment of current LLMs for use in high-stakes scenarios.

翻译：随着企业急于将大语言模型（LLMs）整合到其搜索服务中，确保其提供的事实准确信息能够稳健应对用户可能表达的任何预设前提至关重要。本研究提出了UPHILL数据集，该数据集包含具有不同程度预设前提的健康相关查询。我们利用UPHILL评估了InstructGPT、ChatGPT和BingChat模型的事实准确性和一致性。研究发现，尽管模型回答极少与真实健康主张（以问题形式呈现）相矛盾，但它们往往无法反驳虚假主张：InstructGPT、ChatGPT和BingChat的回答分别与32%、26%和23%的虚假主张一致。随着输入查询中预设程度的加深，无论主张的真实性如何，InstructGPT和ChatGPT的回答与主张的一致性显著增加。而依赖检索网页的BingChat回答则不易受影响。鉴于当前模型的事实准确性中等，且无法始终纠正错误假设，我们的工作呼吁对高风险场景中使用的大语言模型进行审慎评估。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日