Conversational multimodal understanding aims to infer the meaning or label of the current utterance from its preceding dialogue context together with textual, acoustic, and visual signals. Existing methods mainly strengthen contextual modeling through enhanced encoding, fusion, or propagation, but rarely abstract the context-utterance dependency into an explicit cue and incorporate it into later multimodal reasoning. To address this issue, we propose CUCI-Net for conversational multimodal understanding. CUCI-Net fully preserves the structural distinction between context and utterance during encoding, effectively abstracts their dependency into an interpretation cue by combining local modality evidence with global contextual evidence, and seamlessly integrates the resulting cue into the final multimodal interaction stage for context-conditioned prediction. Extensive experiments on mainstream benchmark datasets fully demonstrate the effectiveness of the proposed method.
翻译:对话多模态理解旨在结合文本、声学及视觉信号,从先前对话上下文中推断当前话语的含义或标签。现有方法主要通过增强编码、融合或传播来强化上下文建模,但鲜有将上下文-话语依赖关系抽象为显式线索并融入后续多模态推理。为解决此问题,我们提出用于对话多模态理解的CUCI-Net模型。该模型在编码阶段完全保留上下文与话语的结构区分,通过结合局部模态证据与全局上下文证据有效提取其依赖关系作为解释线索,并将生成的线索无缝集成至最终的多模态交互阶段以实现上下文条件化预测。在主流基准数据集上的大量实验充分验证了所提方法的有效性。