Simultaneous speech-to-text translation (SimulST) generates translations while speech is still unfolding, requiring a streaming policy that decides when to read and when to write. State-of-the-art approaches rely on attention-based encoder-decoder models where cross-attention provides explicit alignment signals. In contrast, Speech Large Language Models (SpeechLLMs) are decoder-only architectures relying solely on self-attention. This raises a central question: whether decoder self-attention contains sufficiently stable alignment signals to guide the streaming policy. Moreover, existing approaches typically rely on training-based adaptations or heuristic wait-$k$ policies and have not been validated in long-form settings. To fill these gaps, we propose Decoder-Only Attention (DOA), a training-free policy that enables long-form simultaneous translation with off-the-shelf SpeechLLMs by deriving a proxy alignment from self-attention. Experiments on Phi4-Multimodal and Qwen3-Omni show that DOA provides an effective alignment signal for supporting streaming decisions, enabling low-latency long-form SimulST with quality close to offline decoding without retraining.
翻译:同声语音翻译(SimulST)在语音持续输出时同步生成译文,需要确定何时读取与何时写入的流式策略。现有最优方法基于注意力机制的编码器-解码器模型,其中交叉注意力提供了显式的对齐信号。相比之下,语音大语言模型作为仅解码器架构,仅依赖自注意力机制。这引发了一个关键问题:解码器自注意力是否包含足够稳定的对齐信号来指导流式策略。此外,现有方法通常依赖基于训练的适配或启发式等待$k$策略,且尚未在长句场景中得到验证。为填补这些空白,我们提出仅解码器注意力这一无需训练的策略,通过从自注意力中推导代理对齐信号,使现成的语音大语言模型能够实现长句同声传译。在Phi4-Multimodal和Qwen3-Omni上的实验表明,DOA能为支持流式决策提供有效的对齐信号,从而在不重新训练的情况下实现接近离线解码质量且低延迟的长句SimulST。