Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset

In an effort to mitigate the harms of large language models (LLMs), learning from human feedback (LHF) has been used to steer LLMs towards outputs that are intended to be both less harmful and more helpful. Despite the widespread adoption of LHF in practice, the quality of this feedback and its effectiveness as a safety mitigation technique remain unclear. This study addresses these issues by auditing the widely-used Helpful and Harmless (HH) dataset by Anthropic. Our work includes: (1) a thorough investigation of the dataset's content through both manual and automated evaluation; (2) experiments demonstrating the dataset's impact on models' safety; and (3) an analysis of the 100 most influential papers citing this dataset. Through our audit, we showcase how conceptualization failures and quality issues identified in the HH dataset can create additional harms by leading to disparate safety behaviors across demographic groups. Our findings highlight the need for more nuanced, context-sensitive approaches to safety mitigation in LLMs.

翻译：为减轻大型语言模型（LLMs）可能造成的危害，基于人类反馈的学习（LHF）已被用于引导LLMs生成旨在更少有害且更有帮助的输出。尽管LHF在实践中已被广泛采用，但此类反馈的质量及其作为安全缓解技术的有效性仍不明确。本研究通过审计Anthropic公司广泛使用的Helpful and Harmless（HH）数据集来探讨这些问题。我们的工作包括：（1）通过人工与自动化评估相结合的方式对数据集内容进行深入调查；（2）展示数据集对模型安全性影响的实验；（3）对引用该数据集的100篇最具影响力文献的分析。通过本次审计，我们揭示了HH数据集中存在的概念化缺陷与质量问题如何导致不同人口统计群体间的安全性行为差异，进而可能引发额外危害。我们的研究结果强调，在LLMs的安全缓解方面需要采用更细致、更具情境敏感性的方法。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日