In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manually-generated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs, namely word-consensus-networks, allows the SLU system to improve in comparison to the 1-best setup (5.5% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, a relative improvement of 17.8% over the 1-best configuration, being a recommended alternative to overcome the limitations of working with automatically generated transcripts.
翻译:本文针对口语语言理解(SLU)系统中的意图分类任务,对不同表示方法进行了系统性评估。我们构建了三种类型的系统用于SLU意图检测任务基准测试:1)纯文本方法,2)词格方法,以及3)创新的多模态方法。本研究全面分析了不同前沿SLU系统在自动生成与人工标注转录文本等不同场景下的性能上限。通过在公开SLURP口语语料库上的实验表明,采用更丰富的自动语音识别(ASR)输出形式(即词语共识网络)可使SLU系统相较最优单候选方案获得5.5%的相对性能提升。值得注意的是,跨模态方法(即联合学习声学特征与文本嵌入)的性能接近理论最优设置,相较最优单候选配置实现了17.8%的相对提升,成为克服自动生成转录文本局限性的推荐方案。