Spoken language understanding systems using audio-only data are gaining popularity, yet their ability to handle unseen intents remains limited. In this study, we propose a generalized zero-shot audio-to-intent classification framework with only a few sample text sentences per intent. To achieve this, we first train a supervised audio-to-intent classifier by making use of a self-supervised pre-trained model. We then leverage a neural audio synthesizer to create audio embeddings for sample text utterances and perform generalized zero-shot classification on unseen intents using cosine similarity. We also propose a multimodal training strategy that incorporates lexical information into the audio representation to improve zero-shot performance. Our multimodal training approach improves the accuracy of zero-shot intent classification on unseen intents of SLURP by 2.75% and 18.2% for the SLURP and internal goal-oriented dialog datasets, respectively, compared to audio-only training.
翻译:仅使用音频数据的口语理解系统日益普及,但其处理未见意图的能力仍十分有限。在本研究中,我们提出了一种广义零样本音频到意图分类框架,每个意图仅需少量文本样例句子。为此,我们首先利用自监督预训练模型训练一个监督式音频到意图分类器。随后,我们借助神经音频合成器为样例文本语句生成音频嵌入,并通过余弦相似度对未见意图进行广义零样本分类。我们还提出了一种多模态训练策略,将词汇信息融入音频表示以提升零样本性能。与纯音频训练相比,我们的多模态训练方法在SLURP和内部目标导向对话数据集上,对SLURP未见意图的零样本意图分类准确率分别提升了2.75%和18.2%。