RADAR: Revealing Asymmetric Development of Abilities in MLLM Pre-training

Pre-trained Multi-modal Large Language Models (MLLMs) provide a knowledge-rich foundation for post-training by leveraging their inherent perception and reasoning capabilities to solve complex tasks. However, the lack of an efficient evaluation framework impedes the diagnosis of their performance bottlenecks. Current evaluation primarily relies on testing after supervised fine-tuning, which introduces laborious additional training and autoregressive decoding costs. Meanwhile, common pre-training metrics cannot quantify a model's perception and reasoning abilities in a disentangled manner. Furthermore, existing evaluation benchmarks are typically limited in scale or misaligned with pre-training objectives. Thus, we propose RADAR, an efficient ability-centric evaluation framework for Revealing Asymmetric Development of Abilities in MLLM pRe-training. RADAR involves two key components: (1) Soft Discrimination Score, a novel metric for robustly tracking ability development without fine-tuning, based on quantifying nuanced gradations of the model preference for the correct answer over distractors; and (2) Multi-Modal Mixture Benchmark, a new 15K+ sample benchmark for comprehensively evaluating pre-trained MLLMs' perception and reasoning abilities in a 0-shot manner, where we unify authoritative benchmark datasets and carefully collect new datasets, extending the evaluation scope and addressing the critical gaps in current benchmarks. With RADAR, we comprehensively reveal the asymmetric development of perceptual and reasoning capabilities in pretrained MLLMs across diverse factors, including data volume, model size, and pretraining strategy. Our RADAR underscores the need for a decomposed perspective on pre-training ability bottlenecks, informing targeted interventions to advance MLLMs efficiently. Our code is publicly available at https://github.com/Nieysh/RADAR.

翻译：预训练的多模态大语言模型（MLLMs）通过利用其固有的感知与推理能力解决复杂任务，为后续训练提供了知识丰富的基础。然而，缺乏高效的评估框架阻碍了对其性能瓶颈的诊断。当前的评估主要依赖于监督微调后的测试，这引入了繁重的额外训练和自回归解码成本。同时，常见的预训练指标无法以解耦的方式量化模型的感知与推理能力。此外，现有的评估基准通常规模有限或与预训练目标不一致。因此，我们提出了RADAR，一个高效的、以能力为中心的评估框架，用于揭示MLLM预训练中能力的不对称发展。RADAR包含两个关键组成部分：（1）软判别分数，一种无需微调即可稳健追踪能力发展的新指标，其基础是量化模型对正确答案相对于干扰项的细微偏好程度；（2）多模态混合基准，一个包含超过1.5万个样本的新基准，用于以零样本方式全面评估预训练MLLMs的感知与推理能力。我们在此基准中统一了权威的基准数据集并精心收集了新的数据集，从而扩展了评估范围并弥补了当前基准的关键空白。利用RADAR，我们全面揭示了预训练MLLMs在不同因素（包括数据量、模型规模和预训练策略）下感知与推理能力的不对称发展。我们的RADAR强调了对预训练能力瓶颈进行分解式审视的必要性，为实施针对性干预以高效推进MLLMs的发展提供了依据。我们的代码公开于 https://github.com/Nieysh/RADAR。