Multimodal sentiment analysis (MSA) aims to predict human sentiment from textual, acoustic, and visual information in videos. Recent studies improve multimodal fusion by modeling modality interaction and assigning different modality weights. However, they usually compress diverse sentiment cues into a single compact representation before sentiment reasoning. This early aggregation makes it difficult to preserve the internal structure of sentiment evidence, where different cues may complement, conflict with, or differ in reliability from each other. In addition, modality importance is often determined only once during fusion, so later reasoning cannot further adjust modality contributions. To address these issues, we propose PRISM, a framework that unifies structured affective extraction and adaptive modality evaluation. PRISM organizes multimodal evidence in a shared prototype space, which supports structured cross-modal comparison and adaptive fusion. It further applies dynamic modality reweighting during reasoning, allowing modality contributions to be continuously refined as semantic interactions become deeper. Experiments on three benchmark datasets show that PRISM outperforms representative baselines.
翻译:多模态情感分析旨在从视频中的文本、声学及视觉信息预测人类情感。现有研究通过建模模态交互并分配不同模态权重来改进多模态融合。然而,这类方法通常会在进行情感推理前将多样化的情感线索压缩为单一紧凑表示。这种早期聚合使得情感证据的内部结构难以保留——不同线索可能相互补充、冲突或在可靠性上存在差异。此外,模态重要性通常在融合阶段仅被确定一次,后续推理无法进一步调整模态贡献。为解决上述问题,我们提出PRISM框架,该框架统一了结构化情感提取与自适应模态评估。PRISM将多模态证据组织在共享原型空间中,支持结构化跨模态比较与自适应融合。其在推理过程中进一步应用动态模态重加权,使得模态贡献能随语义交互深化而持续优化。在三个基准数据集上的实验表明,PRISM显著优于代表性基线方法。