End-to-end ASR systems typically use fixed-depth acoustic encoders at inference, making it difficult to trade additional test-time computation for improved recognition without training a larger model. A natural approach is to reuse a shared Transformer block recurrently, but we find that naive looping does not fully exploit additional recurrent compute. We introduce LARM, a depth-conditioned looped Transformer that turns recurrent encoder depth into a controllable test-time compute axis. LARM combines sparse CTC checkpoints, supervision-clock embeddings, FiLM depth conditioning, and delayed soft-posterior feedback. These components structure the loop into recognition checkpoints separated by latent refinement phases and allow shared weights to specialize across recurrent steps. On LibriSpeech, LARM improves WER as the number of inference loops increases and achieves performance competitive with deeper unshared-parameter baselines. Our results show that test-time compute scaling can extend beyond autoregressive language-model reasoning to continuous non-autoregressive speech recognition.
翻译:端到端自动语音识别(ASR)系统通常在推理时使用固定深度的声学编码器,这使得在不训练更大模型的情况下,难以通过增加测试时的额外计算量来换取识别性能的提升。一种自然的方法是循环使用共享的 Transformer 块,但我们发现,简单的循环方式并不能充分利用额外的循环计算。我们提出了 LARM,一种深度条件化循环 Transformer,它将循环编码器深度转变为可控的测试时计算轴。LARM 结合了稀疏的 CTC 检查点、监督时钟嵌入、FiLM 深度条件化以及延迟的软后验反馈。这些组件将循环结构化为由潜在优化阶段分隔的识别检查点,并允许共享权重在循环步骤间进行特化。在 LibriSpeech 上,随着推理循环次数的增加,LARM 的词错误率(WER)持续下降,其他能与更深的非共享参数基线模型相当。我们的结果表明,测试时计算缩放可以从自回归语言模型的推理扩展到连续的非自回归语音识别。