Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this gap is not a uniform cognitive deficit. Evaluating two architecturally diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. Yet on logical tasks requiring entity tracking, S2T accuracy collapses to chance. We diagnose this as an entity binding failure: continuous speech features blur precise entity-property associations during implicit reasoning. To validate this diagnosis, we introduce Entity-Aware Chain-of-Thought (EA-CoT), a lightweight inference-time intervention forcing SLLMs to enumerate entities and bind them to claims before reasoning. EA-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24.4 percentage-point accuracy gain. Ablations confirm the gains stem from explicit semantic binding, reframing the gap as an elicitation failure rather than a missing capability.
翻译:语音大模型在复杂推理任务上表现逊于文本大模型。我们发现这一差距并非由统一的认知缺陷导致。通过评估两个架构不同的语音大模型,我们证明在空间、句法和事实类任务中,语音到文本模式的表现可达到甚至超越文本到文本模式。然而在需要实体追踪的逻辑推理任务中,语音到文本模式的准确率骤降至随机水平。我们将其诊断为实体绑定失败:连续语音特征在隐式推理过程中模糊了精确的实体-属性关联。为验证这一诊断,我们提出实体感知思维链方法,这是一种轻量级推理时干预策略,强制语音大模型在推理前枚举实体并将其与所述论断绑定。即使语音名称被误识别,实体感知思维链仍能弥合性能差距,准确率提升最高达24.4个百分点。消融实验证实,性能提升源于显式语义绑定,将原有差距重新定义为能力调用失败而非能力缺失。