Current large language models (LLMs) have demonstrated emerging capabilities in social intelligence tasks, including implicature resolution and theory-of-mind reasoning, both of which require substantial pragmatic understanding. However, how LLMs acquire this pragmatic competence throughout the training process remains poorly understood. In this work, we introduce ALTPRAG, a dataset grounded in the pragmatic concept of alternatives, to evaluate whether LLMs at different training stages can accurately infer nuanced speaker intentions. Each instance pairs two equally plausible yet pragmatically divergent continuations and requires the model to (i) infer the speaker's intended meaning and (ii) explain when and why a speaker would choose one utterance over its alternative, thus directly probing pragmatic competence through contrastive reasoning. We systematically evaluate 22 LLMs across 3 key training stages: after pre-training, supervised fine-tuning (SFT), and preference optimization, to examine the development of pragmatic competence. Our results show that even base models exhibit notable sensitivity to pragmatic cues, which improves consistently with increases in model and data scale. Additionally, SFT and RLHF contribute further gains, particularly in cognitive-pragmatic scenarios. These findings highlight pragmatic competence as an emergent and compositional property of LLM training and offer new insights for aligning models with human communicative norms.
翻译:当前的大型语言模型(LLMs)已在社会智能任务中展现出新兴能力,包括隐含意义解析和心理理论推理,这两者均需要深厚的实用理解能力。然而,LLMs在训练过程中如何获得这种实用能力仍不甚明了。本研究引入基于实用替代概念的ALTPRAG数据集,以评估不同训练阶段的LLMs是否能准确推断细微的说话者意图。每个实例包含两个在语义上同等合理但实用含义不同的延续句,要求模型(i)推断说话者的真实意图,以及(ii)解释说话者何时及为何会选择某一种表述而非其替代形式,从而通过对比推理直接检验实用能力。我们系统评估了22个LLMs在三个关键训练阶段的表现:预训练后、监督微调(SFT)阶段和偏好优化阶段,以考察实用能力的发展轨迹。研究结果表明,即使基础模型也对实用线索表现出显著敏感性,且这种敏感性随模型规模和数据规模的增加持续提升。此外,SFT和RLHF进一步增强了模型的实用能力,尤其在认知-实用场景中效果显著。这些发现揭示了实用能力是LLM训练中涌现的复合属性,并为使模型符合人类交际规范提供了新的见解。