Speech or text representation generated by pre-trained models contains modal-specific information that could be combined for benefiting spoken language understanding (SLU) tasks. In this work, we propose a novel pre-training paradigm termed Continuous Integrate-and-Fire Pre-Training (CIF-PT). It relies on a simple but effective frame-to-token alignment: continuous integrate-and-fire (CIF) to bridge the representations between speech and text. It jointly performs speech-to-text training and language model distillation through CIF as the pre-training (PT). Evaluated on SLU benchmark SLURP dataset, CIF-PT outperforms the state-of-the-art model by 1.94% of accuracy and 2.71% of SLU-F1 on the tasks of intent classification and slot filling, respectively. We also observe the cross-modal representation extracted by CIF-PT obtains better performance than other neural interfaces for the tasks of SLU, including the dominant speech representation learned from self-supervised pre-training.
翻译:预训练模型生成的语音或文本表征包含模态特有信息,可融合用于提升口语理解(SLU)任务。本文提出一种新颖的预训练范式——连续积分-触发预训练(CIF-PT)。其核心依赖一种简单高效的帧到令牌对齐机制:连续积分-触发(CIF)来桥接语音与文本表征。该方法通过CIF联合执行语音到文本训练与语言模型蒸馏作为预训练(PT)。在SLU基准数据集SLURP上的评测表明,CIF-PT在意图分类和槽位填充任务上分别以1.94%的准确率提升和2.71%的SLU-F1值超越现有最佳模型。我们还观察到,CIF-PT提取的跨模态表征在SLU任务中优于其他神经接口,包括通过自监督预训练学习到的主流语音表征。