Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this gap is not a uniform cognitive deficit. Evaluating two architecturally diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. Yet on logical tasks requiring entity tracking, S2T accuracy collapses to chance. We diagnose this as an entity binding failure: continuous speech features blur precise entity-property associations during implicit reasoning. To validate this diagnosis, we introduce Entity-Aware Chain-of-Thought (EA-CoT), a lightweight inference-time intervention forcing SLLMs to enumerate entities and bind them to claims before reasoning. EA-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24.4 percentage-point accuracy gain. Ablations confirm the gains stem from explicit semantic binding, reframing the gap as an elicitation failure rather than a missing capability.

翻译：语音大模型在复杂推理任务上表现逊于文本大模型。我们发现这一差距并非由统一的认知缺陷导致。通过评估两个架构不同的语音大模型，我们证明在空间、句法和事实类任务中，语音到文本模式的表现可达到甚至超越文本到文本模式。然而在需要实体追踪的逻辑推理任务中，语音到文本模式的准确率骤降至随机水平。我们将其诊断为实体绑定失败：连续语音特征在隐式推理过程中模糊了精确的实体-属性关联。为验证这一诊断，我们提出实体感知思维链方法，这是一种轻量级推理时干预策略，强制语音大模型在推理前枚举实体并将其与所述论断绑定。即使语音名称被误识别，实体感知思维链仍能弥合性能差距，准确率提升最高达24.4个百分点。消融实验证实，性能提升源于显式语义绑定，将原有差距重新定义为能力调用失败而非能力缺失。

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。

博士论文 | 理解与改进大语言模型推理：从反转诅咒到连续思维链

专知会员服务

12+阅读 · 7月20日

大语言模型的智能体化推理

专知会员服务

35+阅读 · 1月21日

基于大语言模型（LLM）的智能体推理框架：从方法到场景的综述

专知会员服务

55+阅读 · 2025年8月26日

142页DeepSeek-R1 思维链技术：让我们一起<思考>大语言模型（LLM）的推理能力

专知会员服务

48+阅读 · 2025年4月12日