SayNext-Bench: Why Do LLMs Struggle with Next-Utterance Prediction?

We explore the use of large language models (LLMs) for next-utterance prediction in human dialogue. Despite recent advances in LLMs demonstrating their ability to engage in natural conversations with users, we show that even leading models surprisingly struggle to predict a human speaker's next utterance. Instead, humans can readily anticipate forthcoming utterances based on multimodal cues, such as gestures, gaze, and emotional tone, from the context. To systematically examine whether LLMs can reproduce this ability, we propose SayNext-Bench, a benchmark that evaluates LLMs and Multimodal LLMs (MLLMs) on anticipating context-conditioned responses from multimodal cues spanning a variety of real-world scenarios. To support this benchmark, we build SayNext-PC, a novel large-scale dataset containing dialogues with rich multimodal cues. Building on this, we further develop a dual-route prediction MLLM, SayNext-Chat, that incorporates cognitively inspired design to emulate predictive processing in conversation. Experimental results demonstrate that our model outperforms state-of-the-art MLLMs in terms of lexical overlap, semantic similarity, and emotion consistency. Our results prove the feasibility of next-utterance prediction with LLMs from multimodal cues and emphasize the (i) indispensable role of multimodal cues and (ii) actively predictive processing as the foundation of natural human interaction, which is missing in current MLLMs. We hope that this exploration offers a new research entry toward more human-like, context-sensitive AI interaction for human-centered AI. Our benchmark and model can be accessed at https://saynext.github.io/.

翻译：我们探索了使用大语言模型（LLMs）进行人类对话中的下一话语预测。尽管近期LLMs的进展展示了其与用户进行自然对话的能力，但我们发现即使是领先的模型在预测人类说话者的下一话语时也出人意料地困难。相比之下，人类能够基于上下文中的多模态线索（如手势、凝视和情感语调）轻松预判即将到来的话语。为了系统性地检验LLMs是否能复现这种能力，我们提出了SayNext-Bench——一个评估LLMs和多模态大语言模型（MLLMs）在多样化现实场景中基于多模态线索预测上下文条件化响应的基准。为支持该基准，我们构建了SayNext-PC，一个包含丰富多模态线索对话的新型大规模数据集。在此基础上，我们进一步开发了双路径预测MLLM模型SayNext-Chat，该模型融合了认知启发的设计以模拟对话中的预测性处理。实验结果表明，我们的模型在词汇重叠度、语义相似性和情感一致性方面均优于当前最先进的MLLMs。我们的研究证明了基于多模态线索的LLMs下一话语预测的可行性，并强调了（i）多模态线索不可或缺的作用，以及（ii）作为自然人类交互基础的主动预测性处理机制——这正是当前MLLMs所缺失的。我们希望这项探索能为构建更类人、上下文敏感的人类中心AI交互开辟新的研究路径。我们的基准与模型可通过https://saynext.github.io/访问。