Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification. Our study reveals that metadata such as video title, description and geographic location provide substantial information to identify the spoken language of the multimedia recording. We conduct experiments using two diverse public datasets of YouTube videos, and obtain state-of-the-art results on the language identification task. We additionally conduct an ablation study that describes the distinct contribution of each modality for language recognition.
翻译:口语语言识别是指自动预测给定话语中口语语言的任务。传统上,该任务被建模为基于语音的语言识别任务。以往的技术局限于单一模态;然而,在视频数据的情况下,存在大量其他元数据可能对此任务有益。在本研究中,我们提出了MuSeLI,一种多模态口语语言识别方法,深入探索利用各种元数据源来增强语言识别。我们的研究揭示,诸如视频标题、描述和地理位置等元数据提供了大量信息,可用于识别多媒体录制中的口语语言。我们使用两个不同的YouTube视频公开数据集进行实验,并在语言识别任务上取得了最先进的结果。此外,我们进行了消融研究,描述了每种模态在语言识别中的独特贡献。