Recent work identifies a stated-revealed (SvR) preference gap in language models (LMs): a mismatch between the values models endorse and the choices they make in context. Existing evaluations rely heavily on binary forced-choice prompting, which entangles genuine preferences with artifacts of the elicitation protocol. We systematically study how elicitation protocols affect SvR correlation across 24 LMs. Allowing neutrality and abstention during stated preference elicitation allows us to exclude weak signals, substantially improving Spearman's rank correlation ($ρ$) between volunteered stated preferences and forced-choice revealed preferences. However, further allowing abstention in revealed preferences drives $ρ$ to near-zero or negative values due to high neutrality rates. Finally, we find that system prompt steering using stated preferences during revealed preference elicitation does not reliably improve SvR correlation on AIRiskDilemmas. Together, our results show that SvR correlation is highly protocol-dependent and that preference elicitation requires methods that account for indeterminate preferences.
翻译:近期研究发现语言模型存在陈述-揭示偏好差距:即模型所宣称的价值观与其在具体情境中实际选择之间的不匹配。现有评估方法过度依赖二元强制选择提示,这使真实偏好与诱导协议的人为效应相互纠缠。我们系统研究了24种语言模型中诱导协议对陈述-揭示相关性影响。在陈述偏好诱导阶段允许中立和弃权选项,使我们能够排除弱信号,从而显著提升自愿陈述偏好与强制选择揭示偏好之间的斯皮尔曼秩相关系数(ρ)。然而,若在揭示偏好阶段进一步允许弃权,则会因高弃权率导致ρ趋近于零或负值。最后我们发现,在揭示偏好诱导阶段使用陈述偏好进行系统提示引导,并不能在AIRiskDilemmas任务上可靠提升陈述-揭示相关性。综合结果表明,陈述-揭示相关性高度依赖诱导协议,偏好诱导方法需考虑不确定偏好的存在。