MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

Multimodal Large Language Models (MLLMs) have achieved remarkable success in instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often struggle with fine-grained visual grounding due to semantic entanglement in visual patch representations, where individual patches blend multiple distinct visual elements, making it difficult for models to focus on instruction-relevant details. To address this challenge, we propose MoDA (Modulation Adapter), a lightweight module that enhances visual grounding through instruction-guided channel-wise modulation. Unlike token-level methods such as Q-Former that perform additive feature selection, MoDA operates at the channel level through multiplicative modulation on already-aligned features, enabling fine-grained control over which embedding dimensions are relevant for each instruction. Following the standard LLaVA training protocol, MoDA applies cross-attention between language instructions and pre-aligned visual features, generating dynamic modulation masks without architectural modifications or additional supervision. We evaluate MoDA across 12 benchmarks spanning visual question answering, vision-centric reasoning, and hallucination detection, including recent 2024 benchmarks (MMVP, CV-Bench, MMStar, RealWorldQA), on three distinct MLLM architectures: LLaVA-1.5, LLaVA-MoRE (2025), and Qwen3-VL (2025). MoDA delivers consistent gains across all three families, with +12.0 points on MMVP for the LLaVA-1.5 family and +4.8 points on ScienceQA for the LLaVA-MoRE family, and +4.9 ScienceQA, +4.1 RealWorldQA, and +3.8 GQA on Qwen3-VL, confirming that the gains generalize beyond CLIP-based encoders with minimal overhead (<1% FLOPs). Code is available at https://github.com/waybarrios/MoDA.

翻译：多模态大语言模型通过将预训练视觉编码器与大语言模型相结合，在指令跟随任务中取得了显著成功。然而，现有方法常因视觉块表示中的语义纠缠问题而难以实现细粒度视觉定位——单个视觉块混合多种不同视觉元素，导致模型难以聚焦于指令相关的细节。为解决该挑战，我们提出MoDA（调制适配器），一种通过指令引导的通道级调制增强视觉定位能力的轻量级模块。与Q-Former等执行加性特征选择的令牌级方法不同，MoDA在通道层面对已对齐特征执行乘性调制，从而实现对每个指令相关嵌入维度的细粒度控制。遵循标准LLaVA训练协议，MoDA将语言指令与预对齐的视觉特征进行交叉注意力计算，无需架构修改或额外监督即可生成动态调制掩码。我们在涵盖视觉问答、视觉中心推理和幻觉检测的12个基准测试（包括近期2024年基准MMVP、CV-Bench、MMStar、RealWorldQA）上，对三种不同MLLM架构（LLaVA-1.5、LLaVA-MoRE（2025版）及Qwen3-VL（2025版））进行评估。MoDA在所有三大架构系列中均取得一致性增益：在LLaVA-1.5系列的MMVP上提升+12.0个百分点，在LLaVA-MoRE系列的ScienceQA上提升+4.8个百分点，在Qwen3-VL系列上ScienceQA、RealWorldQA和GQA分别提升+4.9、+4.1和+3.8个百分点，证实该增益可泛化至CLIP编码器之外，且计算开销极小（<1% FLOPs）。代码开源地址：https://github.com/waybarrios/MoDA。