Speech carries more information than just words: a child's voice, a fearful tone, or a noisy background should all lead a sufficiently competent spoken-dialogue assistant to different replies. Current Speech Language Models (SLMs) can recognize such paralinguistic cues but often ignore them in open-ended dialogue. We observe that a simple paralinguistic instruction scaffold at the inference stage narrows this perception-behavior gap, suggesting that the relevant cues are already latent in the model. Such scaffolds, however, remain brittle under multi-turn context and competing instructions. Therefore, we propose \textbf{ParaBridge}, an on-policy self-distillation method that turns a brittle inference-time scaffold into stable model behavior. During training, the scaffold serves only as a temporary privileged view; the scaffold-free model rolls out its own response, while the scaffolded view supplies dense, full-vocabulary next-token targets along its trajectory. This supervision teaches when non-lexical cues should affect the reply without the need for curated dialogues, human labels, or external reward models. On Qwen3-Omni-thinking, ParaBridge raises scaffold-free VoxSafeBench SAR from $14.6\%$ to $40.3\%$ and improves EchoMind average rating from $3.27$ to $3.92$. It also preserves general ability, with MMAU-Pro, VoiceBench, and GPQA all within $0.4$ points of the original model. Beyond the training distribution, ParaBridge generalizes to unseen paralinguistic cues, transfers from safety-oriented training to empathy-oriented dialogue, and works on a different SLM backbone.
翻译:语音所承载的信息远超词汇本身:儿童的声音、恐惧的语气或嘈杂的背景,都应当引导一个足够出色的语音对话助手给出不同的回应。当前的语音语言模型(SLM)能够识别此类副语言线索,但在开放式对话中常常忽略它们。我们观察到,在推理阶段使用简单的副语言指令脚手架能缩小这种感知-行为差距,表明相关线索已隐含于模型中。然而,此类脚手架在多轮上下文和竞争性指令下仍然脆弱。因此,我们提出 **ParaBridge**,一种在线策略自蒸馏方法,将脆弱的推理时脚手架转化为稳定的模型行为。在训练过程中,脚手架仅作为临时特权视角;无脚手架模型自主生成回应,而带脚手架视角沿其轨迹提供密集的、全词汇的下一词目标。这种监督机制教会模型何时非词汇线索应影响回复,而无需精心设计的对话、人工标注或外部奖励模型。在 Qwen3-Omni-thinking 上,ParaBridge 将无脚手架 VoxSafeBench SAR 从 14.6% 提升至 40.3%,并将 EchoMind 平均评分从 3.27 提升至 3.92。同时,它保持了通用能力,MMAU-Pro、VoiceBench 和 GPQA 均在原模型 0.4 分以内。在训练分布之外,ParaBridge 可泛化至未见过的副语言线索,从安全导向训练迁移至共情导向对话,并在不同的 SLM 骨干网络上有效工作。