Ensuring the safety of Generative AI requires a nuanced understanding of pluralistic viewpoints. In this paper, we introduce a novel data-driven approach for analyzing ordinal safety ratings in pluralistic settings. Specifically, we address the challenge of interpreting nuanced differences in safety feedback from a diverse population expressed via ordinal scales (e.g., a Likert scale). We define non-parametric responsiveness metrics that quantify how raters convey broader distinctions and granular variations in the severity of safety violations. Leveraging publicly available datasets of pluralistic safety feedback as our case studies, we investigate how raters from different demographic groups use an ordinal scale to express their perceptions of the severity of violations. We apply our metrics across violation types, demonstrating their utility in extracting nuanced insights that are crucial for aligning AI systems reliably in multi-cultural contexts. We show that our approach can inform rater selection and feedback interpretation by capturing nuanced viewpoints across different demographic groups, hence improving the quality of pluralistic data collection and in turn contributing to more robust AI alignment.
翻译:确保生成式人工智能的安全性需要深入理解多元化的观点。本文提出了一种新颖的数据驱动方法,用于分析多元环境中的序数安全评分。具体而言,我们解决了如何解读来自不同人群通过序数量表(如李克特量表)表达的安全反馈中细微差异的挑战。我们定义了非参数响应度指标,用于量化评分者在传达安全违规严重性时如何体现更广泛的区分度和粒度变化。利用公开可用的多元安全反馈数据集作为案例研究,我们探究了来自不同人口统计学群体的评分者如何使用序数量表表达其对违规严重性的感知。我们将所提出的指标应用于多种违规类型,证明了其在提取细微洞察方面的实用性,这对于在多文化背景下可靠地对齐人工智能系统至关重要。研究表明,我们的方法能够通过捕捉不同人口统计学群体的细致观点,为评分者选择和反馈解读提供依据,从而提升多元数据收集的质量,并最终促进更稳健的人工智能对齐。