CALM: Curiosity-Driven Auditing for Large Language Models

Auditing Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing black-box LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For instance, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated reinforcement learning to finetune an LLM as the auditor agent to uncover potential harmful and biased input-output pairs of the target LLM. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting. This work offers a promising direction for auditing black-box LLMs. Our code is available at https://github.com/x-zheng16/CALM.git.

翻译：审计大型语言模型（LLMs）是一项关键且具有挑战性的任务。本研究聚焦于在无法访问模型参数、仅能调用服务的情况下对黑盒LLMs进行审计。我们将此类审计视为黑盒优化问题，其目标是自动发现目标LLMs中表现出非法、不道德或不安全行为的输入-输出对。例如，我们可能寻找一个非恶意的输入，使目标LLM产生恶意输出；或寻找能诱导目标LLM生成包含政治敏感人物的幻觉性回复的输入。由于可行点稀缺、提示空间离散性以及搜索空间庞大，这种黑盒优化极具挑战性。为解决这些问题，我们提出面向大型语言模型的好奇心驱动审计（CALM），该方法通过内在激励的强化学习微调一个LLM作为审计智能体，以发现目标LLM潜在有害及存在偏见的输入-输出对。CALM成功识别出涉及名人的贬损性补全内容，并在黑盒设置下发现了能诱发特定姓名的输入。这项工作为审计黑盒LLMs提供了有前景的研究方向。代码已发布于https://github.com/x-zheng16/CALM.git。

相关内容

黑盒

关注 1

在科学，计算和工程学中，黑盒是一种设备，系统或对象，可以根据其输入和输出（或传输特性）对其进行查看，而无需对其内部工作有任何了解。它的实现是“不透明的”（黑色）。几乎任何事物都可以被称为黑盒：晶体管，引擎，算法，人脑，机构或政府。为了使用典型的“黑匣子方法”来分析建模为开放系统的事物，仅考虑刺激/响应的行为，以推断（未知）盒子。该黑匣子系统的通常表示形式是在该方框中居中的数据流程图。黑盒的对立面是一个内部组件或逻辑可用于检查的系统，通常将其称为白盒（有时也称为“透明盒”或“玻璃盒”）。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日