Recently, large pretrained language models have demonstrated strong language understanding capabilities. This is particularly reflected in their zero-shot and in-context learning abilities on downstream tasks through prompting. To assess their impact on spoken language understanding (SLU), we evaluate several such models like ChatGPT and OPT of different sizes on multiple benchmarks. We verify the emergent ability unique to the largest models as they can reach intent classification accuracy close to that of supervised models with zero or few shots on various languages given oracle transcripts. By contrast, the results for smaller models fitting a single GPU fall far behind. We note that the error cases often arise from the annotation scheme of the dataset; responses from ChatGPT are still reasonable. We show, however, that the model is worse at slot filling, and its performance is sensitive to ASR errors, suggesting serious challenges for the application of those textual models on SLU.
翻译:近来,大规模预训练语言模型展现了强大的语言理解能力,这尤其体现在其通过提示在下游任务上的零样本和上下文学习能力。为评估其对口语语言理解的影响,我们测试了多个此类模型(如不同规模的ChatGPT和OPT)在多基准上的表现。我们验证了最大规模模型独有的涌现能力——在给定理想转录的情况下,它们能在多种语言的零样本或少样本场景下达到与监督模型相近的意图分类准确率。相比之下,适配单GPU的小型模型结果则远远落后。我们注意到错误案例常源于数据集的标注方案,而ChatGPT的回复仍具合理性。然而,研究表明模型在槽填充任务上表现较差,且其性能对语音识别错误敏感,这表明这些文本模型在口语语言理解应用中面临严峻挑战。