Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

We present Speech-MASSIVE, a multilingual Spoken Language Understanding (SLU) dataset comprising the speech counterpart for a portion of the MASSIVE textual corpus. Speech-MASSIVE covers 12 languages from different families and inherits from MASSIVE the annotations for the intent prediction and slot-filling tasks. Our extension is prompted by the scarcity of massively multilingual SLU datasets and the growing need for versatile speech datasets to assess foundation models (LLMs, speech encoders) across languages and tasks. We provide a multimodal, multitask, multilingual dataset and report SLU baselines using both cascaded and end-to-end architectures in various training scenarios (zero-shot, few-shot, and full fine-tune). Furthermore, we demonstrate the suitability of Speech-MASSIVE for benchmarking other tasks such as speech transcription, language identification, and speech translation. The dataset, models, and code are publicly available at: https://github.com/hlt-mt/Speech-MASSIVE

翻译：本文提出Speech-MASSIVE，这是一个多语言口语理解（SLU）数据集，包含MASSIVE文本语料库中部分内容的语音对应版本。Speech-MASSIVE涵盖来自不同语系的12种语言，并继承了MASSIVE中用于意图预测和槽位填充任务的标注。我们进行此项扩展的动因在于，目前极度缺乏大规模多语言SLU数据集，且日益需要能够跨语言和任务评估基础模型（如大语言模型、语音编码器）的多功能语音数据集。我们提供了一个多模态、多任务、多语言的数据集，并报告了在各种训练场景（零样本、少样本和全量微调）下使用级联架构和端到端架构的SLU基线结果。此外，我们还证明了Speech-MASSIVE适用于语音转写、语言识别和语音翻译等其他任务的基准测试。数据集、模型和代码已在以下网址公开：https://github.com/hlt-mt/Speech-MASSIVE

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日