The recent popularity of large language models (LLMs) has brought a significant impact to boundless fields, particularly through their open-ended ecosystem such as the APIs, open-sourced models, and plugins. However, with their widespread deployment, there is a general lack of research that thoroughly discusses and analyzes the potential risks concealed. In that case, we intend to conduct a preliminary but pioneering study covering the robustness, consistency, and credibility of LLMs systems. With most of the related literature in the era of LLM uncharted, we propose an automated workflow that copes with an upscaled number of queries/responses. Overall, we conduct over a million queries to the mainstream LLMs including ChatGPT, LLaMA, and OPT. Core to our workflow consists of a data primitive, followed by an automated interpreter that evaluates these LLMs under different adversarial metrical systems. As a result, we draw several, and perhaps unfortunate, conclusions that are quite uncommon from this trendy community. Briefly, they are: (i)-the minor but inevitable error occurrence in the user-generated query input may, by chance, cause the LLM to respond unexpectedly; (ii)-LLMs possess poor consistency when processing semantically similar query input. In addition, as a side finding, we find that ChatGPT is still capable to yield the correct answer even when the input is polluted at an extreme level. While this phenomenon demonstrates the powerful memorization of the LLMs, it raises serious concerns about using such data for LLM-involved evaluation in academic development. To deal with it, we propose a novel index associated with a dataset that roughly decides the feasibility of using such data for LLM-involved evaluation. Extensive empirical studies are tagged to support the aforementioned claims.
翻译:近期,大型语言模型(LLMs)的普及,特别是通过其开放的生态系统(如API、开源模型及插件),为众多领域带来了深远影响。然而,随着其广泛部署,目前普遍缺乏全面探讨与分析其潜在风险的研究。为此,我们旨在开展一项初步但具有开创性的研究,涵盖LLM系统的鲁棒性、一致性与可信度。鉴于LLM时代相关文献大多尚未探索,我们提出了一种自动化流程,以应对大规模查询/响应的处理需求。总体而言,我们对主流LLM(包括ChatGPT、LLaMA及OPT)执行了超过百万次的查询。该流程的核心包括一个数据原语,随后是一个自动化解释器,用于在不同对抗性度量体系下评估这些LLM。最终,我们得出了若干或许令人遗憾的结论,这些结论在该热门社区中相当罕见。简要概括如下:(i)用户生成查询输入中微小但不可避免的错误,可能偶然导致LLM产生意外响应;(ii)LLM在处理语义相似的查询输入时,一致性较差。此外,作为附带发现,我们发现ChatGPT即使在输入遭受极端污染的情况下,仍能输出正确答案。虽然这一现象展示了LLM强大的记忆能力,但也引发了关于在学术开发中使用此类数据评估LLM的严重担忧。为解决此问题,我们提出了一种与数据集关联的新指标,该指标可粗略判断使用此类数据进行LLM评估的可行性。大量实证研究支持了上述论断。