Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning

While multi-modal large language models (MLLMs) have made significant progress in complex reasoning tasks via reinforcement learning, it is commonly believed that extensive training data is necessary for improving multi-modal reasoning ability, inevitably leading to data redundancy and substantial computational costs. However, can smaller high-value datasets match or outperform full corpora for multi-modal reasoning in MLLMs? In this work, we challenge this assumption through a key observation: meaningful multi-modal reasoning is triggered by only a sparse subset of training samples, termed cognitive samples, whereas the majority contribute marginally. Building on this insight, we propose a novel data selection paradigm termed Reasoning Activation Potential (RAP)}, which identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning by two complementary estimators: 1) Causal Discrepancy Estimator (CDE) based on the potential outcome model principle, eliminates samples that overly rely on language priors by comparing outputs between multi-modal and text-only inputs; 2) Attention Confidence Estimator (ACE), which exploits token-level self-attention to discard samples dominated by irrelevant but over-emphasized tokens in intermediate reasoning stages. Moreover, we introduce a Difficulty-aware Replacement Module (DRM) to substitute trivial instances with cognitively challenging ones, thereby ensuring complexity for robust multi-modal reasoning. Experiments on six datasets show that our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%.

翻译：尽管多模态大语言模型（MLLMs）通过强化学习在复杂推理任务上取得了显著进展，但普遍认为提升多模态推理能力需要大量训练数据，这不可避免地导致数据冗余和巨大的计算成本。然而，对于多模态大语言模型的多模态推理任务，规模较小的高价值数据集能否达到甚至超越完整数据集的性能？在本工作中，我们通过一个关键观察挑战了这一假设：有意义的多模态推理仅由训练样本中一个稀疏的子集（称为认知样本）所触发，而大多数样本的贡献微乎其微。基于这一洞见，我们提出了一种新颖的数据选择范式，称为推理激活潜力（RAP）。该方法通过两个互补的估计器来识别认知样本，即估计每个样本激发真正多模态推理的潜力：1）基于潜在结果模型原则的因果差异估计器（CDE），通过比较多模态输入与纯文本输入下的输出，剔除那些过度依赖语言先验的样本；2）注意力置信度估计器（ACE），它利用词元级自注意力机制，丢弃在中间推理阶段被无关但被过度强调的词元所主导的样本。此外，我们引入了一个难度感知替换模块（DRM），用认知上具有挑战性的实例替换简单实例，从而确保复杂度以实现鲁棒的多模态推理。在六个数据集上的实验表明，我们的RAP方法仅使用9.3%的训练数据就能持续取得更优的性能，同时将计算成本降低超过43%。