Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

The recent popularity of large language models (LLMs) has brought a significant impact to boundless fields, particularly through their open-ended ecosystem such as the APIs, open-sourced models, and plugins. However, with their widespread deployment, there is a general lack of research that thoroughly discusses and analyzes the potential risks concealed. In that case, we intend to conduct a preliminary but pioneering study covering the robustness, consistency, and credibility of LLMs systems. With most of the related literature in the era of LLM uncharted, we propose an automated workflow that copes with an upscaled number of queries/responses. Overall, we conduct over a million queries to the mainstream LLMs including ChatGPT, LLaMA, and OPT. Core to our workflow consists of a data primitive, followed by an automated interpreter that evaluates these LLMs under different adversarial metrical systems. As a result, we draw several, and perhaps unfortunate, conclusions that are quite uncommon from this trendy community. Briefly, they are: (i)-the minor but inevitable error occurrence in the user-generated query input may, by chance, cause the LLM to respond unexpectedly; (ii)-LLMs possess poor consistency when processing semantically similar query input. In addition, as a side finding, we find that ChatGPT is still capable to yield the correct answer even when the input is polluted at an extreme level. While this phenomenon demonstrates the powerful memorization of the LLMs, it raises serious concerns about using such data for LLM-involved evaluation in academic development. To deal with it, we propose a novel index associated with a dataset that roughly decides the feasibility of using such data for LLM-involved evaluation. Extensive empirical studies are tagged to support the aforementioned claims.

翻译：大型语言模型（LLM）的近期普及通过其开放的生态系统（如API、开源模型和插件）对诸多领域产生了显著影响。然而，随着其广泛部署，目前普遍缺乏充分讨论并分析其中潜在风险的研究。为此，我们计划开展一项初步但具有开创性的研究，覆盖LLM系统的鲁棒性、一致性与可信度。鉴于LLM时代相关文献大多尚属未开发领域，我们提出一种自动化工作流程，以应对大规模查询/响应的处理需求。总体而言，我们向主流LLM（包括ChatGPT、LLaMA和OPT）发起超过百万次查询。该工作流程的核心由一个数据原语构成，随后通过自动化解释器在不同对抗性度量体系下评估这些LLM。最终，我们得出了若干（或许令人遗憾的）结论，这些结论与当前热门社区的观点颇不寻常。简而言之：（i）用户生成查询输入中微小但难以避免的错误，可能偶然导致LLM产生意外响应；（ii）LLM在处理语义相似的查询输入时一致性较差。此外，作为附带发现，我们发现ChatGPT即使在输入受到极端污染的情况下仍能输出正确答案。尽管这一现象彰显了LLM强大的记忆能力，但它引发了对学术开发中使用此类数据进行LLM评估的严重担忧。为解决这一问题，我们提出一种与数据集关联的新型索引，可粗略判断此类数据用于LLM评估的可行性。大量实证研究支持了上述论断。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日