Automatic Speech Understanding (ASU) aims at human-like speech interpretation, providing nuanced intent, emotion, sentiment, and content understanding from speech and language (text) content conveyed in speech. Typically, training a robust ASU model relies heavily on acquiring large-scale, high-quality speech and associated transcriptions. However, it is often challenging to collect or use speech data for training ASU due to concerns such as privacy. To approach this setting of enabling ASU when speech (audio) modality is missing, we propose TI-ASU, using a pre-trained text-to-speech model to impute the missing speech. We report extensive experiments evaluating TI-ASU on various missing scales, both multi- and single-modality settings, and the use of LLMs. Our findings show that TI-ASU yields substantial benefits to improve ASU in scenarios where even up to 95% of training speech is missing. Moreover, we show that TI-ASU is adaptive to dropout training, improving model robustness in addressing missing speech during inference.
翻译:自动语音理解(ASU)旨在实现类人化的语音解析,从语音及其承载的语言(文本)内容中获取细粒度意图、情感、情绪及语义理解。通常,训练鲁棒的ASU模型高度依赖大规模高质量语音数据及其对应转录文本的获取。然而,因隐私等考量,在ASU训练中收集或使用语音数据常面临挑战。为应对语音(音频)模态缺失时的ASU实现场景,我们提出TI-ASU方法——利用预训练文本转语音模型对缺失语音进行插补。我们通过大量实验评估了TI-ASU在不同缺失规模、多模态与单模态设置及大语言模型(LLM)应用场景中的表现。结果表明,即使在训练语音缺失比例高达95%的情况下,TI-ASU仍能显著提升ASU性能。此外,我们发现TI-ASU对dropout训练具有自适应性,可增强模型在推理阶段处理语音缺失时的鲁棒性。
亚利桑那州立大学(Arizona State University)是全美最大最佳的五所“大学城”之一,创立于1885年,坐落于距州府凤凰城11英里的大学城坦佩。
亚利桑那州立大学学术力量雄厚,教学一流,被誉为全美州立大学中研究密度最高的大学之一,是全球性跨学科教学和研究的重要中心。其商学院和教育学院排名全美前列。此外,天文学也是亚利桑那州立大学名牌系科。