Code-switching is a widely prevalent linguistic phenomenon in multilingual societies like India. Building speech-to-text models for code-switched speech is challenging due to limited availability of datasets. In this work, we focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text. We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules (that are more widely available for many languages). Speech and ASR text representations are fused using an aligned interleaving scheme and are fed further as input to a pretrained MT module; the whole pipeline is then trained end-to-end for spoken translation using synthetically created ST data. We also release a new evaluation benchmark for code-switched Bengali-English, Hindi-English, Marathi-English and Telugu- English speech to English text. COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
翻译:语码转换是印度等多语言社会中广泛存在的语言现象。由于数据集的有限可用性,为语码转换语音构建语音到文本模型具有挑战性。在本研究中,我们聚焦于将印度语言中的语码转换语音翻译为英语文本的口语翻译问题。我们提出了一种新的端到端模型架构CoSTA,该架构基于预训练的自动语音识别模块和机器翻译模块(这两种模块对许多语言而言更广泛可用)进行构建。通过采用对齐交错编码方案融合语音与ASR文本表示,并将其进一步作为预训练MT模块的输入;随后利用合成创建的ST数据对整个流水线进行端到端的口语翻译训练。我们还发布了针对孟加拉语-英语、印地语-英语、马拉地语-英语及泰卢固语-英语语码转换语音到英语文本的全新评估基准。CoSTA以最高3.5个BLEU分数的优势显著超越多种具有竞争力的级联式及端到端多模态基线模型。