SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In particular, there are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark datasets for several tasks. In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape. We contribute four tasks: question answering and summarization involve inference over longer speech sequences; named entity localization addresses the speech-specific task of locating the targeted content in the signal; dialog act classification identifies the function of a given speech utterance. We follow the blueprint of the Spoken Language Understanding Evaluation (SLUE) benchmark suite. In order to facilitate the development of SLU models that leverage the success of pre-trained speech representations, we will be publishing for each task (i) annotations for a relatively small fine-tuning set, (ii) annotated development and test sets, and (iii) baseline models for easy reproducibility and comparisons. In this work, we present the details of data collection and annotation and the performance of the baseline models. We also perform sensitivity analysis of pipeline models' performance (speech recognizer + text model) to the speech recognition accuracy, using more than 20 state-of-the-art speech recognition models.

翻译：口语理解任务在语音研究领域已被研究数十年，但其受关注程度远低于语音识别、说话人识别等底层任务。当前不仅口语理解任务基准数量不足，且许多现有基准使用的数据并非对所有研究人员免费开放。近年来，已有研究开始针对若干任务引入此类基准数据集。本研究基于可免费获取的语音数据，发布了若干全新带标注的口语理解基准任务，这些任务既补充了现有基准的不足，也填补了口语理解评估领域的空白。我们贡献了四项任务：问答与摘要涉及对较长语音序列的推理；命名实体定位聚焦于在信号中定位目标内容的语音特有任务；对话行为分类则用于识别给定语音话语的功能。我们遵循口语理解评估基准套件的设计蓝图。为促进利用预训练语音表征优势的口语理解模型开发，我们将针对每项任务发布以下资源：（i）相对较小的微调数据集标注；（ii）带标注的开发集与测试集；（iii）便于复现与对比的基线模型。本文详细阐述了数据采集与标注过程，并展示了基线模型的性能。此外，我们采用超过20个当前最优语音识别模型，对流水线模型（语音识别器+文本模型）的性能进行了针对语音识别精度的敏感性分析。