Speech-to-text (S2T) systems for recognition (ASR) and translation (S2TT) typically generate discrete text tokens. In contrast, continuous-target language modelling performs generation in a continuous space, yet its potential for S2T remains unexplored. To bridge this gap, we propose ELF-S2T, an audio-conditioned continuous-target generative model for S2T. Built upon the pre-trained Embedded Language Flows (ELF) backbone, ELF-S2T processes speech via a frozen Whisper encoder and a single linear projector, prepending the resulting audio condition to the noisy text latent for in-context, flow-matching denoising. To prevent the model from over-relying on its pre-trained text context, we introduce audio forcing during training, and further amplify the audio condition via classifier-free guidance at inference. Experiments on LibriSpeech and CoVoST2 show that ELF-S2T achieves competitive ASR and S2TT performance. Crucially, our error analysis reveals that, although ASR and S2TT errors look very different on the surface, both stem from the same underlying cause, a close distance confusion in the continuous latent space. This finding naturally aligns with the continuous representation generation paradigm, indicating a common semantic mapping process beneath recognition and translation. Our code and pretrained models are publicly available at https://github.com/Sslnon/ELF-S2T.
翻译:语音到文本(S2T)系统在识别(ASR)和翻译(S2TT)任务中通常生成离散的文本标记。相比之下,连续目标语言建模在连续空间中进行生成,但其在S2T中的潜力尚未被探索。为弥合这一差距,我们提出ELF-S2T,一种面向S2T的音频条件连续目标生成模型。该模型基于预训练的嵌入式语言流(Embedded Language Flows, ELF)骨干网络,通过冻结的Whisper编码器和单线性投影器处理语音,将所得音频条件前置到含噪文本潜在表示中,以实现上下文内流匹配去噪。为防止模型过度依赖预训练文本上下文,我们在训练中引入音频强制策略,并在推理时通过无分类器引导进一步增强音频条件。在LibriSpeech和CoVoST2上的实验表明,ELF-S2T取得了具有竞争力的ASR和S2TT性能。关键的是,我们的误差分析揭示:尽管ASR和S2TT误差在表面上差异显著,但其根源相同——均源于连续潜在空间中的近距离混淆。这一发现自然契合连续表示生成范式,表明识别与翻译背后存在共同的语义映射过程。我们的代码和预训练模型已在https://github.com/Sslnon/ELF-S2T 公开。