In an effort to mitigate the harms of large language models (LLMs), learning from human feedback (LHF) has been used to steer LLMs towards outputs that are intended to be both less harmful and more helpful. Despite the widespread adoption of LHF in practice, the quality of this feedback and its effectiveness as a safety mitigation technique remain unclear. This study addresses these issues by auditing the widely-used Helpful and Harmless (HH) dataset by Anthropic. Our work includes: (1) a thorough investigation of the dataset's content through both manual and automated evaluation; (2) experiments demonstrating the dataset's impact on models' safety; and (3) an analysis of the 100 most influential papers citing this dataset. Through our audit, we showcase how conceptualization failures and quality issues identified in the HH dataset can create additional harms by leading to disparate safety behaviors across demographic groups. Our findings highlight the need for more nuanced, context-sensitive approaches to safety mitigation in LLMs.
翻译:为减轻大型语言模型(LLMs)可能造成的危害,基于人类反馈的学习(LHF)已被用于引导LLMs生成旨在更少有害且更有帮助的输出。尽管LHF在实践中已被广泛采用,但此类反馈的质量及其作为安全缓解技术的有效性仍不明确。本研究通过审计Anthropic公司广泛使用的Helpful and Harmless(HH)数据集来探讨这些问题。我们的工作包括:(1)通过人工与自动化评估相结合的方式对数据集内容进行深入调查;(2)展示数据集对模型安全性影响的实验;(3)对引用该数据集的100篇最具影响力文献的分析。通过本次审计,我们揭示了HH数据集中存在的概念化缺陷与质量问题如何导致不同人口统计群体间的安全性行为差异,进而可能引发额外危害。我们的研究结果强调,在LLMs的安全缓解方面需要采用更细致、更具情境敏感性的方法。