In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety alignment in large language models (LLMs). This dataset uniquely separates annotations of helpfulness and harmlessness for question-answering pairs, thus offering distinct perspectives on these crucial attributes. In total, we have compiled safety meta-labels for 30,207 question-answer (QA) pairs and gathered 30,144 pairs of expert comparison data for both the helpfulness and harmlessness metrics. We further showcase applications of BeaverTails in content moderation and reinforcement learning with human feedback (RLHF), emphasizing its potential for practical safety measures in LLMs. We believe this dataset provides vital resources for the community, contributing towards the safe development and deployment of LLMs. Our project page is available at the following URL: https://sites.google.com/view/pku-beavertails.
翻译:本文提出BeaverTails数据集,旨在促进大语言模型安全对齐的研究。该数据集独特地将问答对的有用性与无害性标注分离,从而为这两个关键属性提供不同视角。我们共编译了30,207个问答对的安全元标签,并收集了30,144对针对有用性与无害性指标的专家对比数据。进一步展示了BeaverTails在内容审核与基于人类反馈的强化学习中的应用,凸显其在大语言模型实际安全措施中的潜力。我们相信,该数据集将为社区提供关键资源,助力大语言模型的安全开发与部署。项目页面详见以下网址:https://sites.google.com/view/pku-beavertails。