Recognizing a speaker's level of commitment to a belief is a difficult task; humans do not only interpret the meaning of the words in context, but also understand cues from intonation and other aspects of the audio signal. Many papers and corpora in the NLP community have approached the belief prediction task using text-only approaches. We are the first to frame and present results on the multimodal belief prediction task. We use the CB-Prosody corpus (CBP), containing aligned text and audio with speaker belief annotations. We first report baselines and significant features using acoustic-prosodic features and traditional machine learning methods. We then present text and audio baselines for the CBP corpus fine-tuning on BERT and Whisper respectively. Finally, we present our multimodal architecture which fine-tunes on BERT and Whisper and uses multiple fusion methods, improving on both modalities alone.
翻译:识别说话人对某一信念的承诺程度是一项困难的任务;人类不仅需要理解词语在语境中的含义,还需从语调及音频信号的其他方面获取线索。自然语言处理领域的许多论文和语料库仅采用纯文本方法处理信念预测任务。我们首次提出并展示了多模态信念预测任务的研究结果。我们使用CB-Prosody语料库(CBP),该语料库包含对齐的文本与音频,并附有说话人信念标注。我们首先利用声学韵律特征和传统机器学习方法报告基线结果与显著特征,随后分别针对BERT和Whisper进行微调,呈现CBP语料库的文本与音频基线。最后,我们提出一种多模态架构,该架构对BERT和Whisper进行微调,并采用多种融合方法,在两种单一模态基础上均实现了性能提升。