Variation in human annotation (i.e., disagreements) is common in NLP, often reflecting important information like task subjectivity and sample ambiguity. Modeling this variation is important for applications that are sensitive to such information. Although RLVR-style reasoning (Reinforcement Learning with Verifiable Rewards) has improved Large Language Model (LLM) performance on many tasks, it remains unclear whether such reasoning enables LLMs to capture informative variation in human annotation. In this work, we evaluate the influence of different reasoning settings on LLM disagreement modeling. We systematically evaluate each reasoning setting across model sizes, distribution expression methods, and steering methods, resulting in 60 experimental setups across 3 tasks. Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling, while naive Chain-of-Thought (CoT) reasoning improves the performance of RLHF LLMs (RL from human feedback). These findings underscore the potential risk of replacing human annotators with reasoning LLMs, especially when disagreements are important.
翻译:在自然语言处理中,人类标注的差异(即分歧)普遍存在,通常反映了任务主观性和样本歧义性等重要信息。对此类信息敏感的应用中,对这种差异进行建模至关重要。尽管RLVR式推理(基于可验证奖励的强化学习)已提升大型语言模型在许多任务上的表现,但此类推理是否能使LLMs捕捉人类标注中有意义的差异仍不明确。本研究评估了不同推理设置对LLM分歧建模的影响。我们系统地在模型规模、分布表达方法和引导方法三个维度上评估每种推理设置,最终在3个任务上构建了60种实验配置。令人惊讶的是,结果显示RLVR式推理会降低分歧建模的性能,而朴素的思维链推理反而能提升基于人类反馈的强化学习LLMs的性能。这些发现凸显了用推理型LLMs替代人类标注者的潜在风险,尤其是在分歧具有重要意义的情况下。