Most End-to-End SLU methods depend on the pretrained ASR or language model features for intent prediction. However, other essential information in speech, such as prosody, is often ignored. Recent research has shown improved results in classifying dialogue acts by incorporating prosodic information. The margins of improvement in these methods are minimal as the neural models ignore prosodic features. In this work, we propose prosody-attention, which uses the prosodic features differently to generate attention maps across time frames of the utterance. Then we propose prosody-distillation to explicitly learn the prosodic information in the acoustic encoder rather than concatenating the implicit prosodic features. Both the proposed methods improve the baseline results, and the prosody-distillation method gives an intent classification accuracy improvement of 8\% and 2\% on SLURP and STOP datasets over the prosody baseline.
翻译:多数端到端口语理解方法依赖于预训练的自动语音识别模型或语言模型特征来预测意图,但语音中的其他关键信息(如韵律)常被忽略。最新研究表明,融入韵律信息可提升对话行为分类效果,然而现有方法的提升幅度有限,因为神经网络模型忽略了韵律特征。本文提出韵律注意机制,通过差异化利用韵律特征生成跨话语时间帧的注意力图谱;随后提出韵律蒸馏方法,在声学编码器中显式学习韵律信息,而非拼接隐式韵律特征。两种方法均改进了基线结果,其中韵律蒸馏方法在SLURP和STOP数据集上的意图分类准确率较韵律基线分别提升8%和2%。