ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

Speech carries more information than just words: a child's voice, a fearful tone, or a noisy background should all lead a sufficiently competent spoken-dialogue assistant to different replies. Current Speech Language Models (SLMs) can recognize such paralinguistic cues but often ignore them in open-ended dialogue. We observe that a simple paralinguistic instruction scaffold at the inference stage narrows this perception-behavior gap, suggesting that the relevant cues are already latent in the model. Such scaffolds, however, remain brittle under multi-turn context and competing instructions. Therefore, we propose \textbf{ParaBridge}, an on-policy self-distillation method that turns a brittle inference-time scaffold into stable model behavior. During training, the scaffold serves only as a temporary privileged view; the scaffold-free model rolls out its own response, while the scaffolded view supplies dense, full-vocabulary next-token targets along its trajectory. This supervision teaches when non-lexical cues should affect the reply without the need for curated dialogues, human labels, or external reward models. On Qwen3-Omni-thinking, ParaBridge raises scaffold-free VoxSafeBench SAR from $14.6\%$ to $40.3\%$ and improves EchoMind average rating from $3.27$ to $3.92$. It also preserves general ability, with MMAU-Pro, VoiceBench, and GPQA all within $0.4$ points of the original model. Beyond the training distribution, ParaBridge generalizes to unseen paralinguistic cues, transfers from safety-oriented training to empathy-oriented dialogue, and works on a different SLM backbone.

翻译：语音所承载的信息远超词汇本身：儿童的声音、恐惧的语气或嘈杂的背景，都应当引导一个足够出色的语音对话助手给出不同的回应。当前的语音语言模型（SLM）能够识别此类副语言线索，但在开放式对话中常常忽略它们。我们观察到，在推理阶段使用简单的副语言指令脚手架能缩小这种感知-行为差距，表明相关线索已隐含于模型中。然而，此类脚手架在多轮上下文和竞争性指令下仍然脆弱。因此，我们提出 **ParaBridge**，一种在线策略自蒸馏方法，将脆弱的推理时脚手架转化为稳定的模型行为。在训练过程中，脚手架仅作为临时特权视角；无脚手架模型自主生成回应，而带脚手架视角沿其轨迹提供密集的、全词汇的下一词目标。这种监督机制教会模型何时非词汇线索应影响回复，而无需精心设计的对话、人工标注或外部奖励模型。在 Qwen3-Omni-thinking 上，ParaBridge 将无脚手架 VoxSafeBench SAR 从 14.6% 提升至 40.3%，并将 EchoMind 平均评分从 3.27 提升至 3.92。同时，它保持了通用能力，MMAU-Pro、VoiceBench 和 GPQA 均在原模型 0.4 分以内。在训练分布之外，ParaBridge 可泛化至未见过的副语言线索，从安全导向训练迁移至共情导向对话，并在不同的 SLM 骨干网络上有效工作。