Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

The recent popularity of large language models (LLMs) has brought a significant impact to boundless fields, particularly through their open-ended ecosystem such as the APIs, open-sourced models, and plugins. However, with their widespread deployment, there is a general lack of research that thoroughly discusses and analyzes the potential risks concealed. In that case, we intend to conduct a preliminary but pioneering study covering the robustness, consistency, and credibility of LLMs systems. With most of the related literature in the era of LLM uncharted, we propose an automated workflow that copes with an upscaled number of queries/responses. Overall, we conduct over a million queries to the mainstream LLMs including ChatGPT, LLaMA, and OPT. Core to our workflow consists of a data primitive, followed by an automated interpreter that evaluates these LLMs under different adversarial metrical systems. As a result, we draw several, and perhaps unfortunate, conclusions that are quite uncommon from this trendy community. Briefly, they are: (i)-the minor but inevitable error occurrence in the user-generated query input may, by chance, cause the LLM to respond unexpectedly; (ii)-LLMs possess poor consistency when processing semantically similar query input. In addition, as a side finding, we find that ChatGPT is still capable to yield the correct answer even when the input is polluted at an extreme level. While this phenomenon demonstrates the powerful memorization of the LLMs, it raises serious concerns about using such data for LLM-involved evaluation in academic development. To deal with it, we propose a novel index associated with a dataset that roughly decides the feasibility of using such data for LLM-involved evaluation. Extensive empirical studies are tagged to support the aforementioned claims.

翻译：近期，大型语言模型（LLMs）的普及，特别是通过其开放的生态系统（如API、开源模型及插件），为众多领域带来了深远影响。然而，随着其广泛部署，目前普遍缺乏全面探讨与分析其潜在风险的研究。为此，我们旨在开展一项初步但具有开创性的研究，涵盖LLM系统的鲁棒性、一致性与可信度。鉴于LLM时代相关文献大多尚未探索，我们提出了一种自动化流程，以应对大规模查询/响应的处理需求。总体而言，我们对主流LLM（包括ChatGPT、LLaMA及OPT）执行了超过百万次的查询。该流程的核心包括一个数据原语，随后是一个自动化解释器，用于在不同对抗性度量体系下评估这些LLM。最终，我们得出了若干或许令人遗憾的结论，这些结论在该热门社区中相当罕见。简要概括如下：（i）用户生成查询输入中微小但不可避免的错误，可能偶然导致LLM产生意外响应；（ii）LLM在处理语义相似的查询输入时，一致性较差。此外，作为附带发现，我们发现ChatGPT即使在输入遭受极端污染的情况下，仍能输出正确答案。虽然这一现象展示了LLM强大的记忆能力，但也引发了关于在学术开发中使用此类数据评估LLM的严重担忧。为解决此问题，我们提出了一种与数据集关联的新指标，该指标可粗略判断使用此类数据进行LLM评估的可行性。大量实证研究支持了上述论断。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

Meta最新WWW2022《联邦计算导论》教程，附77页ppt

专知会员服务

60+阅读 · 2022年5月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日