Transformers have been the dominant architecture for Speech Translation in recent years, achieving significant improvements in translation quality. Since speech signals are longer than their textual counterparts, and due to the quadratic complexity of the Transformer, a down-sampling step is essential for its adoption in Speech Translation. Instead, in this research, we propose to ease the complexity by using a Perceiver encoder to map the speech inputs to a fixed-length latent representation. Furthermore, we introduce a novel way of training Perceivers, with Dynamic Latent Access (DLA), unlocking larger latent spaces without any additional computational overhead. Speech-to-Text Perceivers with DLA can match the performance of Transformer baselines across three language pairs in MuST-C. Finally, a DLA-trained model is easily adaptable to DLA at inference, and can be flexibly deployed with various computational budgets, without significant drops in translation quality.
翻译:近年来,Transformer已成为语音翻译的主导架构,在翻译质量上取得了显著提升。由于语音信号比文本信号更长,且Transformer具有二次复杂度,下采样步骤对于其在语音翻译中的应用至关重要。相反,在本研究中,我们提出通过使用感知器编码器将语音输入映射到固定长度的潜在表示来缓解复杂度。此外,我们引入了一种新颖的感知器训练方法——动态潜在访问(DLA),该方法能够在无需额外计算开销的情况下解锁更大的潜在空间。采用DLA的语音到文本感知器在MuST-C数据集上与三个语言对的Transformer基线性能相当。最后,DLA训练的模型在推理时易于适应DLA,并可灵活部署在不同计算预算下,而翻译质量不会显著下降。