Causal Emotion Recognition in Conversation: Context Saturation and Discourse-Marker Evidence

We address two persistent gaps in Emotion Recognition in Conversation: which modeling choices materially affect performance, and how recognition findings connect to interpretable discourse-level patterns. We study both through a systematic investigation on IEMOCAP with cross-dataset validation on MELD. For recognition, we run controlled ablations with 10 random seeds and paired significance tests with multiple-comparisons correction, yielding three findings. First, conversational context is the dominant factor, but performance saturates quickly: roughly 90% of the gain is captured within the most recent 10-30 preceding turns, depending on the label set. Second, hierarchical sentence representations help most in utterance-only settings and show a clear advantage on MELD, but their benefit disappears once turn-level context is available, suggesting that conversational history subsumes much of the intra-utterance structure. Third, integrating an external affective lexicon does not improve results, consistent with pretrained encoders already capturing most of the affective signal needed for ERC. Under a strictly causal setting, our simple models achieve strong performance (82.69% 4-way; 67.07% 6-way weighted F1), showing that competitive accuracy is achievable without future turns. For linguistic analysis, we examine 5,286 discourse-marker occurrences and find a reliable association between emotion and marker position (p < .0001). Sad utterances show reduced left-periphery marker usage (21.9%) relative to other emotions (28-32%), consistent with accounts linking left-periphery markers to active discourse management. This aligns with our recognition results, where Sad benefits most from conversational context (+22 percentage points), suggesting sadness may be more context-dependent than emotions with stronger local pragmatic cues.

翻译：我们解决了对话情绪识别中两个长期存在的空白：哪些建模选择显著影响性能，以及识别结果如何与可解释的话语层面模式相关联。我们通过系统研究，在IEMOCAP上进行实验，并在MELD上进行跨数据集验证。对于识别任务，我们进行了10个随机种子的受控消融实验，并采用多重比较校正的配对显著性检验，得出三点发现。首先，对话上下文是主导因素，但性能迅速饱和：大约90%的性能提升来自最近的10-30个对话轮次（取决于标签集）。其次，层级句子表示在仅考虑话语的设置中帮助最大，并在MELD上表现出明显优势，但一旦引入轮次级别的上下文，其优势消失，这表明对话历史已涵盖了大部分话语内部结构。第三，整合外部情感词典并未提升结果，这与预训练编码器已捕获ERC所需的大部分情感信号一致。在严格的因果设置下，我们的简单模型实现了强性能（4分类加权F1为82.69%；6分类加权F1为67.07%），表明无需未来轮次即可达到竞争性准确率。在语言学分析中，我们检验了5286个话语标记出现，发现情绪与标记位置之间存在可靠关联（p < .0001）。悲伤话语的左边缘标记使用率（21.9%）低于其他情绪（28-32%），这与将左边缘标记与主动话语管理关联的解释一致。这符合我们的识别结果，其中悲伤从对话上下文中获益最大（+22个百分点），表明悲伤可能比具有更强局部语用线索的情绪更依赖上下文。