End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently. This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR), and outperforms traditional pipeline SLU systems in on-device streaming scenarios. However, E2E SLU systems still show weakness when text representation quality is low due to ASR transcription errors. To overcome this issue, we propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses. We introduce two novel techniques: 1) an effective method to encode the quality of ASR hypotheses and 2) an effective approach to integrate them into E2E SLU models. We show accuracy improvements on STOP dataset and share the analysis to demonstrate the effectiveness of our approach.
翻译:端到端口语理解(E2E SLU)系统近年来在直接从语音生成语义解析方面展现出更广阔的前景。这类方法利用单个模型整合预训练语音识别模型(ASR)的音频与文本表征,在设备端流式处理场景中优于传统流水线式SLU系统。然而,当ASR转录错误导致文本表征质量下降时,E2E SLU系统仍存在固有缺陷。为解决该问题,我们提出一种新型E2E SLU系统:基于ASR假设的模态置信度估计,融合音频与文本表征以增强对ASR错误的鲁棒性。我们引入两项创新技术:1)一种有效编码ASR假设质量的方案;2)一种将其高效集成至E2E SLU模型的方法。我们在STOP数据集上验证了精度提升,并通过分析实验证明了该方法的有效性。