Recently, large pretrained language models have demonstrated strong language understanding capabilities. This is particularly reflected in their zero-shot and in-context learning abilities on downstream tasks through prompting. To assess their impact on spoken language understanding (SLU), we evaluate several such models like ChatGPT and OPT of different sizes on multiple benchmarks. We verify the emergent ability unique to the largest models as they can reach intent classification accuracy close to that of supervised models with zero or few shots on various languages given oracle transcripts. By contrast, the results for smaller models fitting a single GPU fall far behind. We note that the error cases often arise from the annotation scheme of the dataset; responses from ChatGPT are still reasonable. We show, however, that the model is worse at slot filling, and its performance is sensitive to ASR errors, suggesting serious challenges for the application of those textual models on SLU.
翻译:近期,大规模预训练语言模型展现了强大的语言理解能力,这尤其体现在其通过提示在少样本任务中实现的零样本与上下文学习能力。为评估其对口语理解(SLU)的影响,我们在多个基准上测试了ChatGPT及不同规模的OPT等模型。我们验证了大模型独有的涌现能力——在给定理想转写文本的条件下,其零样本或少样本意图分类准确率已接近监督模型在多种语言上的表现。相比之下,适配单GPU的小模型结果则显著落后。我们注意到,错误案例通常源于数据集的标注方案;ChatGPT的回应仍具合理性。然而,研究表明模型在槽填充任务上表现较差,且其性能对语音识别(ASR)错误敏感,这揭示出将此类文本模型应用于SLU面临的严峻挑战。