We propose a novel cascaded cross-modal transformer (CCMT) that combines speech and text transcripts to detect customer requests and complaints in phone conversations. Our approach leverages a multimodal paradigm by transcribing the speech using automatic speech recognition (ASR) models and translating the transcripts into different languages. Subsequently, we combine language-specific BERT-based models with Wav2Vec2.0 audio features in a novel cascaded cross-attention transformer model. We apply our system to the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge, reaching unweighted average recalls (UAR) of 65.41% and 85.87% for the complaint and request classes, respectively.
翻译:我们提出了一种新型级联跨模态Transformer(CCMT),该模型结合语音与文本转录,用于检测电话对话中的客户请求与投诉。该方法通过自动语音识别(ASR)模型将语音转写为文本,并进一步将转录文本翻译为不同语言,从而利用多模态范式。随后,我们将基于特定语言的BERT模型与Wav2Vec2.0音频特征相结合,构建了一种新型级联跨注意力Transformer模型。我们将该系统应用于ACM Multimedia 2023计算副语言挑战赛的请求子任务,针对投诉类别和请求类别分别达到了65.41%和85.87%的非加权平均召回率(UAR)。