Speech data from different domains has distinct acoustic and linguistic characteristics. It is common to train a single multidomain model such as a Conformer transducer for speech recognition on a mixture of data from all domains. However, changing data in one domain or adding a new domain would require the multidomain model to be retrained. To this end, we propose a framework called modular domain adaptation (MDA) that enables a single model to process multidomain data while keeping all parameters domain-specific, i.e., each parameter is only trained by data from one domain. On a streaming Conformer transducer trained only on video caption data, experimental results show that an MDA-based model can reach similar performance as the multidomain model on other domains such as voice search and dictation by adding per-domain adapters and per-domain feed-forward networks in the Conformer encoder.
翻译:不同领域的语音数据具有独特的声学和语言特征。通常,一种常见做法是在所有领域数据的混合集上训练一个单一的跨领域模型(例如用于语音识别的Conformer转导器)。然而,当某个领域的数据发生变化或需要新增领域时,该跨领域模型就需要重新训练。为此,我们提出了一种名为模块化领域自适应(MDA)的框架,该框架使单一模型能够处理多领域数据,同时保持所有参数的领域特异性——即每个参数仅由一个领域的数据进行训练。在仅使用视频字幕数据训练的流式Conformer转导器上进行的实验表明,通过向Conformer编码器添加每个领域的适配器和前馈网络,基于MDA的模型在语音搜索、听写等其他领域上的性能可达到与跨领域模型相当的水平。