Although speech large language models have achieved notable progress, a substantial modality reasoning gap remains: their reasoning performance on speech inputs is markedly weaker than on text. This gap could be associated with representational drift across Transformer layers and behavior deviations in long-chain reasoning. To address this issue, we introduce TARS, a reinforcement-learning framework that aligns text-conditioned and speech-conditioned trajectories through an asymmetric reward design. The framework employs two dense and complementary signals: representation alignment, which measures layer-wise hidden-state similarity between speech- and text-conditioned trajectories, and behavior alignment, which evaluates semantic consistency between generated outputs and reference text completions. Experiments on challenging reasoning benchmarks, including MMSU and OBQA, show that our approach significantly narrows the modality reasoning gap and achieves state-of-the-art performance among 7B-scale Speech LLMs.
翻译:尽管语音大语言模型已取得显著进展,但模态推理鸿沟依然显著:其在语音输入上的推理性能明显弱于文本输入。这一差距可能与Transformer各层间的表征漂移以及长链推理中的行为偏差有关。为解决此问题,我们提出了TARS——一个通过非对称奖励设计来对齐文本条件与语音条件轨迹的强化学习框架。该框架采用两种密集且互补的信号:表征对齐(衡量语音与文本条件轨迹间逐层隐藏状态的相似性)和行为对齐(评估生成输出与参考文本补全之间的语义一致性)。在包括MMSU和OBQA在内的挑战性推理基准测试上的实验表明,我们的方法显著缩小了模态推理差距,并在7B规模的语音大语言模型中实现了最先进的性能。