Beyond Generative Decoding: Discriminative Hidden-State Readout from a Native Omni-Modal LLM for Multimodal Sentiment Analysis

Multimodal sentiment analysis (MSA) infers human affect from language, acoustic, and visual signals. Recent methods increasingly adapt large multimodal models (LMMs) via generative readout: prompting the model to emit a sentiment score as a text string. While convenient, this ties continuous regression to discrete autoregressive decoding, incurring unmeasured costs. We revisit this readout mechanism and propose a discriminative formulation built on the Thinker module of a native omni-modal LLM (Qwen2.5-Omni-7B). Instead of text decoding, we map the final-layer hidden state of the last non-padding token to a continuous score via a lightweight regression head in a single forward pass. Using 4-bit quantization and low-rank adaptation (QLoRA), the entire 7B pipeline -- including video and audio processing -- trains on a single consumer GPU (RTX 5090, 32 GB) with 10-21 GB peak memory and 1.14% trainable parameters. Through a controlled comparison fixing the backbone, data, and LoRA configuration, we isolate the impact of the readout. On CMU-MOSI and CMU-MOSEI, our discriminative readout reaches state-of-the-art accuracy without task-specific feature engineering (MOSI: MAE 0.551, Corr 0.888; MOSEI: MAE 0.506, Corr 0.790) and exhibits strong multi-seed stability. In contrast, the generative readout -- even after equivalent supervised training -- more than doubles the mean absolute error, yields unparsable or out-of-range outputs (2.8% zero-shot), and suffers from higher latency. Modality ablations reveal a text-dominant regime on CMU-MOSI. Our findings indicate that how an LMM is read out is as consequential as how it is trained, demonstrating that a discriminative readout offers a more accurate, efficient, and reliable alternative for continuous MSA.

翻译：多模态情感分析（MSA）从语言、声学与视觉信号中推断人类情感。近期方法越来越多地通过生成式读出（generative readout）适配大型多模态模型（LMM）：即提示模型以文本字符串形式输出情感分数。这种方法虽便捷，却将连续回归任务绑定于离散自回归解码，导致不可估量的成本。我们重新审视该读出机制，并提出基于原生全模态大语言模型（Qwen2.5-Omni-7B）中Thinker模块构建的判别式框架。该方法不依赖文本解码，而是在单次前向传播中，通过轻量级回归头将最后一个非填充token的最终层隐状态映射为连续分数。利用4位量化与低秩适配（QLoRA），整个7B参数的流程（包括视频与音频处理）可在单张消费级GPU（RTX 5090，32 GB）上训练，峰值内存占用10–21 GB，可训练参数仅占1.14%。通过固定骨干网络、数据与LoRA配置的对照实验，我们分离出读出机制的影响。在CMU-MOSI与CMU-MOSEI数据集上，我们的判别式读出在无需任务特定特征工程的情况下达到最优精度（MOSI: MAE 0.551, Corr 0.888; MOSEI: MAE 0.506, Corr 0.790），并展现出强多种子稳定性。相比之下，生成式读出即便经过同等监督训练，其平均绝对误差仍翻倍以上，产生不可解析或超出范围的输出（零样本场景下占比2.8%），且延迟较高。模态消融实验揭示了CMU-MOSI上以文本为主导的模式。我们的发现表明，LMM的读出方式与其训练方式同等重要，验证了判别式读出为连续MSA任务提供了更准确、高效且可靠的替代方案。