Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences, but its application to speech remains underexplored. We present a controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts. Audio preferences prove as reliable as text, with inter-rater agreement reaching good levels (ICC(2,k) $\approx$ .80) at $\sim$9 raters -- the first ICC-based reliability characterization in the preference annotation literature for either modality. However, modality reshapes how people judge: audio raters exhibit narrower decision thresholds, reduced length bias, and more user-oriented evaluation criteria, with near-chance cross-modality agreement. Synthetic ratings further align with human judgments and predict inter-rater agreement, supporting their use both for triaging ambiguous pairs and as full replacements for human annotations.
翻译:基于偏好的强化学习(PbRL)是使人工智能系统与人类偏好对齐的主流框架,但其在语音领域的应用仍待深入探索。我们开展了一项受控的跨模态研究,对比了人类与合成偏好标注对100个提示中相同语义内容的文本与音频评估。音频偏好被证明与文本偏好同样可靠,当评估者数量达到约9人时,评分者间一致性达到良好水平(ICC(2,k) $\approx$ .80)——这是偏好标注文献中首次基于ICC对任一模态进行的信度表征。然而,模态重塑了人们的评判方式:音频评估者表现出更窄的决策阈值、更低的长度偏见以及更以用户为导向的评价标准,且跨模态一致性接近随机水平。合成评分进一步与人类判断对齐,并能预测评分者间一致性,这支持了其既可用于筛选模糊配对,也可作为人类标注的完全替代方案。