In this paper, we introduce the \textsc{BeaverTails} dataset, aimed at fostering research on safety alignment in large language models (LLMs). This dataset uniquely separates annotations of helpfulness and harmlessness for question-answering pairs, thus offering distinct perspectives on these crucial attributes. In total, we have gathered safety meta-labels for 30,207 question-answer (QA) pairs and 30,144 pairs of expert comparison data for both the helpfulness and harmlessness metrics. In total, we have gathered safety meta-labels for 333,963 question-answer (QA) pairs and 361,903 pairs of expert comparison data for both the helpfulness and harmlessness metrics. We further showcase applications of BeaverTails in content moderation and reinforcement learning with human feedback (RLHF), emphasizing its potential for practical safety measures in LLMs. We believe this dataset provides vital resources for the community, contributing towards the safe development and deployment of LLMs. Our project page is available at the following URL: https://sites.google.com/view/pku-beavertails. Warning: this paper contains example data that may be offensive or harmful.
翻译:本文介绍了BeaverTails数据集,旨在推动大语言模型安全对齐研究。该数据集独特地将问答对的有用性与无害性标注分离,从而为这两个关键属性提供了不同的视角。我们总共收集了30,207个问答对的安全元标签,以及针对有用性和无害性指标的30,144对专家比较数据。实际上,我们共收集了333,963个问答对的安全元标签与361,903对针对这两项指标的专家比较数据。我们进一步展示了BeaverTails在内容审核及基于人类反馈的强化学习中的应用,凸显其对大语言模型实际安全措施的潜力。我们相信该数据集将为学术界提供重要资源,助力大语言模型的安全开发与部署。项目页面可访问:https://sites.google.com/view/pku-beavertails。注意:本文包含可能令人不适或有害的示例数据。