The recent popularity of large language models (LLMs) has brought a significant impact to boundless fields, particularly through their open-ended ecosystem such as the APIs, open-sourced models, and plugins. However, with their widespread deployment, there is a general lack of research that thoroughly discusses and analyzes the potential risks concealed. In that case, we intend to conduct a preliminary but pioneering study covering the robustness, consistency, and credibility of LLMs systems. With most of the related literature in the era of LLM uncharted, we propose an automated workflow that copes with an upscaled number of queries/responses. Overall, we conduct over a million queries to the mainstream LLMs including ChatGPT, LLaMA, and OPT. Core to our workflow consists of a data primitive, followed by an automated interpreter that evaluates these LLMs under different adversarial metrical systems. As a result, we draw several, and perhaps unfortunate, conclusions that are quite uncommon from this trendy community. Briefly, they are: (i)-the minor but inevitable error occurrence in the user-generated query input may, by chance, cause the LLM to respond unexpectedly; (ii)-LLMs possess poor consistency when processing semantically similar query input. In addition, as a side finding, we find that ChatGPT is still capable to yield the correct answer even when the input is polluted at an extreme level. While this phenomenon demonstrates the powerful memorization of the LLMs, it raises serious concerns about using such data for LLM-involved evaluation in academic development. To deal with it, we propose a novel index associated with a dataset that roughly decides the feasibility of using such data for LLM-involved evaluation. Extensive empirical studies are tagged to support the aforementioned claims.
翻译:大型语言模型(LLM)的近期普及通过其开放的生态系统(如API、开源模型和插件)对诸多领域产生了显著影响。然而,随着其广泛部署,目前普遍缺乏充分讨论并分析其中潜在风险的研究。为此,我们计划开展一项初步但具有开创性的研究,覆盖LLM系统的鲁棒性、一致性与可信度。鉴于LLM时代相关文献大多尚属未开发领域,我们提出一种自动化工作流程,以应对大规模查询/响应的处理需求。总体而言,我们向主流LLM(包括ChatGPT、LLaMA和OPT)发起超过百万次查询。该工作流程的核心由一个数据原语构成,随后通过自动化解释器在不同对抗性度量体系下评估这些LLM。最终,我们得出了若干(或许令人遗憾的)结论,这些结论与当前热门社区的观点颇不寻常。简而言之:(i)用户生成查询输入中微小但难以避免的错误,可能偶然导致LLM产生意外响应;(ii)LLM在处理语义相似的查询输入时一致性较差。此外,作为附带发现,我们发现ChatGPT即使在输入受到极端污染的情况下仍能输出正确答案。尽管这一现象彰显了LLM强大的记忆能力,但它引发了对学术开发中使用此类数据进行LLM评估的严重担忧。为解决这一问题,我们提出一种与数据集关联的新型索引,可粗略判断此类数据用于LLM评估的可行性。大量实证研究支持了上述论断。