Algorithmic audits are essential tools for examining systems for properties required by regulators or desired by operators. Current audits of large language models (LLMs) primarily rely on black-box evaluations that assess model behavior only through input-output testing. These methods are limited to tests constructed in the input space, often generated by heuristics. In addition, many socially relevant model properties (e.g., gender bias) are abstract and difficult to measure through text-based inputs alone. To address these limitations, we propose a white-box sensitivity auditing framework for LLMs that leverages activation steering to conduct more rigorous assessments through model internals. Our auditing method conducts internal sensitivity tests by manipulating key concepts relevant to the model's intended function for the task. We demonstrate its application to bias audits in four simulated high-stakes LLM decision tasks. Our method consistently reveals substantial dependence on protected attributes in model predictions, even in settings where standard black-box evaluations suggest little or no bias. Our code is openly available at https://github.com/hannahxchen/llm-steering-audit
翻译:算法审计是检验系统是否满足监管要求或运营者期望属性的关键工具。当前对大语言模型(LLM)的审计主要依赖于黑盒评估,仅通过输入输出测试来评估模型行为。这些方法仅限于在输入空间中构建测试用例,且通常基于启发式方法生成。此外,许多具有社会相关性的模型属性(例如性别偏见)较为抽象,仅通过基于文本的输入难以准确测量。为应对这些局限性,我们提出一种面向大语言模型的白盒敏感性审计框架,该框架利用激活导向技术,通过模型内部机制进行更严格的评估。我们的审计方法通过操控与模型任务功能相关的关键概念,在内部执行敏感性测试。我们在四个模拟高风险LLM决策任务中演示了该方法在偏见审计中的应用。我们的方法一致地揭示了模型预测对受保护属性的显著依赖性,即使在标准黑盒评估显示极少或无偏见的场景中亦是如此。代码已开源:https://github.com/hannahxchen/llm-steering-audit