We present Speech-MASSIVE, a multilingual Spoken Language Understanding (SLU) dataset comprising the speech counterpart for a portion of the MASSIVE textual corpus. Speech-MASSIVE covers 12 languages from different families and inherits from MASSIVE the annotations for the intent prediction and slot-filling tasks. Our extension is prompted by the scarcity of massively multilingual SLU datasets and the growing need for versatile speech datasets to assess foundation models (LLMs, speech encoders) across languages and tasks. We provide a multimodal, multitask, multilingual dataset and report SLU baselines using both cascaded and end-to-end architectures in various training scenarios (zero-shot, few-shot, and full fine-tune). Furthermore, we demonstrate the suitability of Speech-MASSIVE for benchmarking other tasks such as speech transcription, language identification, and speech translation. The dataset, models, and code are publicly available at: https://github.com/hlt-mt/Speech-MASSIVE
翻译:本文提出Speech-MASSIVE,这是一个多语言口语理解(SLU)数据集,包含MASSIVE文本语料库中部分内容的语音对应版本。Speech-MASSIVE涵盖来自不同语系的12种语言,并继承了MASSIVE中用于意图预测和槽位填充任务的标注。我们进行此项扩展的动因在于,目前极度缺乏大规模多语言SLU数据集,且日益需要能够跨语言和任务评估基础模型(如大语言模型、语音编码器)的多功能语音数据集。我们提供了一个多模态、多任务、多语言的数据集,并报告了在各种训练场景(零样本、少样本和全量微调)下使用级联架构和端到端架构的SLU基线结果。此外,我们还证明了Speech-MASSIVE适用于语音转写、语言识别和语音翻译等其他任务的基准测试。数据集、模型和代码已在以下网址公开:https://github.com/hlt-mt/Speech-MASSIVE