The number of end-to-end speech recognition models grows every year. These models are often adapted to new domains or languages resulting in a proliferation of expert systems that achieve great results on target data, while generally showing inferior performance outside of their domain of expertise. We explore combination of such experts via confidence-based ensembles: ensembles of models where only the output of the most-confident model is used. We assume that models' target data is not available except for a small validation set. We demonstrate effectiveness of our approach with two applications. First, we show that a confidence-based ensemble of 5 monolingual models outperforms a system where model selection is performed via a dedicated language identification block. Second, we demonstrate that it is possible to combine base and adapted models to achieve strong results on both original and target data. We validate all our results on multiple datasets and model architectures.
翻译:端到端语音识别模型的数量逐年增长。这些模型常被适配到新领域或新语言,导致涌现出大量在目标数据上表现优异、但在其专业领域外性能通常较差的专家系统。我们探索通过基于置信度的集成方法来组合此类专家模型:即仅使用置信度最高的模型输出的集成方法。我们假设除少量验证集外,模型的目标数据不可用。我们通过两个应用场景证明了该方法的有效性。首先,我们展示了基于置信度的5个单语言模型集成优于使用专用语言识别模块进行模型选择的系统。其次,我们证明了可以组合基础模型与适配模型,在原始数据和目标数据上均获得优异结果。我们在多个数据集和模型架构上验证了所有实验结果。