Multimodal sentiment analysis (MSA) infers human affect from language, acoustic, and visual signals. Recent methods increasingly adapt large multimodal models (LMMs) via generative readout: prompting the model to emit a sentiment score as a text string. While convenient, this ties continuous regression to discrete autoregressive decoding, incurring unmeasured costs. We revisit this readout mechanism and propose a discriminative formulation built on the Thinker module of a native omni-modal LLM (Qwen2.5-Omni-7B). Instead of text decoding, we map the final-layer hidden state of the last non-padding token to a continuous score via a lightweight regression head in a single forward pass. Using 4-bit quantization and low-rank adaptation (QLoRA), the entire 7B pipeline -- including video and audio processing -- trains on a single consumer GPU (RTX 5090, 32 GB) with 10-21 GB peak memory and 1.14% trainable parameters. Through a controlled comparison fixing the backbone, data, and LoRA configuration, we isolate the impact of the readout. On CMU-MOSI and CMU-MOSEI, our discriminative readout reaches state-of-the-art accuracy without task-specific feature engineering (MOSI: MAE 0.551, Corr 0.888; MOSEI: MAE 0.506, Corr 0.790) and exhibits strong multi-seed stability. In contrast, the generative readout -- even after equivalent supervised training -- more than doubles the mean absolute error, yields unparsable or out-of-range outputs (2.8% zero-shot), and suffers from higher latency. Modality ablations reveal a text-dominant regime on CMU-MOSI. Our findings indicate that how an LMM is read out is as consequential as how it is trained, demonstrating that a discriminative readout offers a more accurate, efficient, and reliable alternative for continuous MSA.
翻译:多模态情感分析(MSA)从语言、声学与视觉信号中推断人类情感。近期方法越来越多地通过生成式读出(generative readout)适配大型多模态模型(LMM):即提示模型以文本字符串形式输出情感分数。这种方法虽便捷,却将连续回归任务绑定于离散自回归解码,导致不可估量的成本。我们重新审视该读出机制,并提出基于原生全模态大语言模型(Qwen2.5-Omni-7B)中Thinker模块构建的判别式框架。该方法不依赖文本解码,而是在单次前向传播中,通过轻量级回归头将最后一个非填充token的最终层隐状态映射为连续分数。利用4位量化与低秩适配(QLoRA),整个7B参数的流程(包括视频与音频处理)可在单张消费级GPU(RTX 5090,32 GB)上训练,峰值内存占用10–21 GB,可训练参数仅占1.14%。通过固定骨干网络、数据与LoRA配置的对照实验,我们分离出读出机制的影响。在CMU-MOSI与CMU-MOSEI数据集上,我们的判别式读出在无需任务特定特征工程的情况下达到最优精度(MOSI: MAE 0.551, Corr 0.888; MOSEI: MAE 0.506, Corr 0.790),并展现出强多种子稳定性。相比之下,生成式读出即便经过同等监督训练,其平均绝对误差仍翻倍以上,产生不可解析或超出范围的输出(零样本场景下占比2.8%),且延迟较高。模态消融实验揭示了CMU-MOSI上以文本为主导的模式。我们的发现表明,LMM的读出方式与其训练方式同等重要,验证了判别式读出为连续MSA任务提供了更准确、高效且可靠的替代方案。