LegoNN: Building Modular Encoder-Decoder Models

State-of-the-art encoder-decoder models (e.g. for machine translation (MT) or automatic speech recognition (ASR)) are constructed and trained end-to-end as an atomic unit. No component of the model can be (re-)used without the others, making it impossible to share parts, e.g. a high resourced decoder, across tasks. We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for any fine-tuning. To achieve this reusability, the interface between encoder and decoder modules is grounded to a sequence of marginal distributions over a pre-defined discrete vocabulary. We present two approaches for ingesting these marginals; one is differentiable, allowing the flow of gradients across the entire network, and the other is gradient-isolating. To enable the portability of decoder modules between MT tasks for different source languages and across other tasks like ASR, we introduce a modality agnostic encoder which consists of a length control mechanism to dynamically adapt encoders' output lengths in order to match the expected input length range of pre-trained decoders. We present several experiments to demonstrate the effectiveness of LegoNN models: a trained language generation LegoNN decoder module from German-English (De-En) MT task can be reused without any fine-tuning for the Europarl English ASR and the Romanian-English (Ro-En) MT tasks, matching or beating the performance of baseline. After fine-tuning, LegoNN models improve the Ro-En MT task by 1.5 BLEU points and achieve 12.5% relative WER reduction on the Europarl ASR task. To show how the approach generalizes, we compose a LegoNN ASR model from three modules -- each has been learned within different end-to-end trained models on three different datasets -- achieving an overall WER reduction of 19.5%.

翻译：摘要：当前最先进的编码器-解码器模型（例如用于机器翻译（MT）或自动语音识别（ASR）的模型）是作为原子单元进行端到端构建和训练的。模型中的任何组件都无法脱离其他组件被（重复）使用，这使得跨任务共享部分模块（如高资源解码器）成为不可能。我们提出LegoNN，一种构建编码器-解码器架构的方法，使得其组成部分无需任何微调即可应用于其他任务。为实现这种可重用性，编码器与解码器模块之间的接口被固化为预定义离散词汇上的边际分布序列。我们提出两种吸收这些边际分布的方法：一种可微分，允许梯度流经整个网络；另一种则隔离梯度。为支持解码器模块在不同源语言的MT任务之间以及跨ASR等其他任务间的可移植性，我们引入了一种模态无关编码器，它包含一个长度控制机制，可动态调整编码器的输出长度，以匹配预训练解码器预期的输入长度范围。我们通过多项实验证明LegoNN模型的有效性：从德语-英语（De-En）MT任务中训练的语言生成LegoNN解码器模块，无需微调即可直接用于Europarl英语ASR任务和罗马尼亚语-英语（Ro-En）MT任务，其性能与基准模型相当或更优。经过微调后，LegoNN模型在Ro-En MT任务上提升1.5个BLEU点，在Europarl ASR任务上实现12.5%的相对词错误率（WER）降低。为展示该方法的泛化能力，我们由三个模块组合成一个LegoNN ASR模型——每个模块均在不同数据集上通过端到端训练的模型学习得到——最终实现19.5%的整体WER降低。