Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

Full-duplex spoken language models (FD-SLMs) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet the internal mechanism by which they coordinate listening and speaking remains underexplored. We analyze the predictive behavior encoded in FD-SLM hidden representations and find that they exhibit stream-specific predictive patterns: during listening, they preferentially predict the incoming user stream, whereas during speaking, they preferentially predict the model output stream. Building on this observation, we show that FD-SLMs dynamically modulate their internal predictive focus between two states: a generative state aligned with model output generation and a perceptive state aligned with incoming user input. However, this modulation can lag behind abrupt changes in conversational context. During user interruptions, the model remains transiently biased toward the generative state before transitioning into the perceptive state, causing it to miss the beginning of the incoming input. We term this delayed internal transition state inertia. To quantify its downstream impact, we introduce the Zero-Buffer Benchmark (ZBB), a diagnostic benchmark for evaluating immediate interruption comprehension when user speech begins abruptly. We evaluate this setting using response correctness and initial-word occurrence rate (IWOR). Finally, we mitigate state inertia through activation steering with a perception vector, a training-free intervention with little additional computational overhead. Across multiple state-of-the-art FD-SLMs, activation steering substantially improves interruption handling; for example, on PersonaPlex, it improves correctness from 28% to 45% and IWOR from 40% to 72% without any fine-tuning.

翻译：全双工语音语言模型（FD-SLMs）通过同时支持听与说实现无缝语音交互，但其协调听与说的内部机制尚未得到充分探索。我们分析了FD-SLM隐层表示中编码的预测行为，发现其具有流特异性的预测模式：在听取阶段，模型优先预测传入的用户流；在说话阶段，则优先预测模型输出的语音流。基于此观察，我们证明FD-SLMs会动态调节内部预测焦点，在两种状态间切换：与模型输出生成对齐的生成状态，以及与用户输入对齐的感知状态。然而，这种调节在对话语境发生突变时存在滞后。当用户进行打断时，模型在过渡到感知状态前仍短暂偏向生成状态，导致其错过传入输入的起始部分。我们将这种延迟的内部状态切换定义为状态惯性。为量化其下游影响，我们提出零缓冲基准（ZBB）——一种诊断性基准，用于评估用户语音突然开始时的即时打断理解能力。我们通过响应正确率与初始词出现率（IWOR）评估该场景。最后，我们通过引入感知向量的激活引导（无需训练且计算开销极小的干预手段）来缓解状态惯性。在多个最先进的FD-SLMs中，激活引导显著改善了打断处理能力：以PersonaPlex为例，无需任何微调即可将正确率从28%提升至45%，IWOR从40%提升至72%。