We propose Chunk-wise Attention Transducer (CHAT), a novel extension to RNN-T models that processes audio in fixed-size chunks while employing cross-attention within each chunk. This hybrid approach maintains RNN-T's streaming capability while introducing controlled flexibility for local alignment modeling. CHAT significantly reduces the temporal dimension that RNN-T must handle, yielding substantial efficiency improvements: up to 46.2% reduction in peak training memory, up to 1.36X faster training, and up to 1.69X faster inference. Alongside these efficiency gains, CHAT achieves consistent accuracy improvements over RNN-T across multiple languages and tasks -- up to 6.3% relative WER reduction for speech recognition and up to 18.0% BLEU improvement for speech translation. The method proves particularly effective for speech translation, where RNN-T's strict monotonic alignment hurts performance. Our results demonstrate that the CHAT model offers a practical solution for deploying more capable streaming speech models without sacrificing real-time constraints.
翻译:我们提出了分块注意力转换器(CHAT),这是对RNN-T模型的一种新颖扩展,它以固定大小的分块处理音频,并在每个分块内采用交叉注意力机制。这种混合方法保持了RNN-T的流式处理能力,同时为局部对齐建模引入了可控的灵活性。CHAT显著减少了RNN-T必须处理的时间维度,从而带来显著的效率提升:训练峰值内存最多降低46.2%,训练速度最高提升1.36倍,推理速度最高提升1.69倍。除了这些效率优势,CHAT在多种语言和任务上均实现了相对于RNN-T的持续准确率提升——语音识别的相对词错误率最多降低6.3%,语音翻译的BLEU分数最多提升18.0%。该方法对于语音翻译任务尤为有效,因为RNN-T的严格单调对齐机制会损害其性能。我们的结果表明,CHAT模型为部署能力更强且不牺牲实时性约束的流式语音模型提供了实用解决方案。