Auditing Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing black-box LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For instance, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated reinforcement learning to finetune an LLM as the auditor agent to uncover potential harmful and biased input-output pairs of the target LLM. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting. This work offers a promising direction for auditing black-box LLMs. Our code is available at https://github.com/x-zheng16/CALM.git.
翻译:审计大型语言模型(LLMs)是一项关键且具有挑战性的任务。本研究聚焦于在无法访问模型参数、仅能调用服务的情况下对黑盒LLMs进行审计。我们将此类审计视为黑盒优化问题,其目标是自动发现目标LLMs中表现出非法、不道德或不安全行为的输入-输出对。例如,我们可能寻找一个非恶意的输入,使目标LLM产生恶意输出;或寻找能诱导目标LLM生成包含政治敏感人物的幻觉性回复的输入。由于可行点稀缺、提示空间离散性以及搜索空间庞大,这种黑盒优化极具挑战性。为解决这些问题,我们提出面向大型语言模型的好奇心驱动审计(CALM),该方法通过内在激励的强化学习微调一个LLM作为审计智能体,以发现目标LLM潜在有害及存在偏见的输入-输出对。CALM成功识别出涉及名人的贬损性补全内容,并在黑盒设置下发现了能诱发特定姓名的输入。这项工作为审计黑盒LLMs提供了有前景的研究方向。代码已发布于https://github.com/x-zheng16/CALM.git。