Recent work identifies a stated-revealed (SvR) preference gap in language models (LMs): a mismatch between the values models endorse and the choices they make in context. Existing evaluations rely heavily on binary forced-choice prompting, which entangles genuine preferences with artifacts of the elicitation protocol. We systematically study how elicitation protocols affect SvR correlation across 24 LMs. Allowing neutrality and abstention during stated preference elicitation allows us to exclude weak signals, substantially improving Spearman's rank correlation ($ρ$) between volunteered stated preferences and forced-choice revealed preferences. However, further allowing abstention in revealed preferences drives $ρ$ to near-zero or negative values due to high neutrality rates. Finally, we find that system prompt steering using stated preferences during revealed preference elicitation does not reliably improve SvR correlation on AIRiskDilemmas. Together, our results show that SvR correlation is highly protocol-dependent and that preference elicitation requires methods that account for indeterminate preferences.
翻译:近期研究发现语言模型存在陈述-显示偏好差距:即模型认可的价值与其在具体情境中所作选择之间的不一致。现有评估严重依赖二元强制选择提示,这混淆了真实偏好与诱导协议的人为因素。我们系统研究了24个语言模型中诱导协议如何影响陈述-显示相关性。在陈述偏好诱导过程中允许中立和弃权选项,使我们能够排除弱信号,显著提升自愿陈述偏好与强制选择显示偏好之间的斯皮尔曼等级相关系数。然而,进一步在显示偏好中允许弃权时,由于高中立率导致相关系数降至接近零或负值。最后,我们发现使用陈述偏好引导系统提示进行显示偏好诱导,在AIRiskDilemmas数据集上并不能可靠改善陈述-显示相关性。综合来看,我们的结果表明陈述-显示相关性高度依赖协议设计,偏好诱导需要能够处理不确定偏好的方法。