This research introduces STAR, a sociotechnical framework that improves on current best practices for red teaming safety of large language models. STAR makes two key contributions: it enhances steerability by generating parameterised instructions for human red teamers, leading to improved coverage of the risk surface. Parameterised instructions also provide more detailed insights into model failures at no increased cost. Second, STAR improves signal quality by matching demographics to assess harms for specific groups, resulting in more sensitive annotations. STAR further employs a novel step of arbitration to leverage diverse viewpoints and improve label reliability, treating disagreement not as noise but as a valuable contribution to signal quality.
翻译:本研究提出STAR,一个改进当前大型语言模型安全红队测试最佳实践的社会技术框架。STAR做出两项关键贡献:首先,通过为人类红队成员生成参数化指令来增强可操控性,从而提升风险面的覆盖范围。参数化指令还能在不增加成本的情况下,为模型失效机制提供更精细的洞察。其次,STAR通过匹配人口统计学特征以评估特定群体所受危害,从而提升信号质量,获得更敏感的标注结果。该框架进一步采用创新的仲裁步骤,利用多元观点提升标签可靠性,将标注分歧视为对信号质量的有价值贡献而非噪声。