MMR-Bench: A Comprehensive Benchmark for Multimodal LLM Routing

Multimodal large language models (MLLMs) have advanced rapidly, yet heterogeneity in architecture, alignment strategies, and efficiency means that no single model is uniformly superior across tasks. In practical deployments, workloads span lightweight OCR to complex multimodal reasoning; using one MLLM for all queries either over-provisions compute on easy instances or sacrifices accuracy on hard ones. Query-level model selection (routing) addresses this tension, but extending routing from text-only LLMs to MLLMs is nontrivial due to modality fusion, wide variation in computational cost across models, and the absence of a standardized, budget-aware evaluation. We present MMR-Bench, a unified benchmark that isolates the multimodal routing problem and enables comparison under fixed candidate sets and cost models. MMR-Bench provides (i) a controlled environment with modality-aware inputs and variable compute budgets, (ii) a broad suite of vision-language tasks covering OCR, general VQA, and multimodal math reasoning, and (iii) strong single-model reference, oracle upper bounds, and representative routing policies. Using MMR-Bench, we show that incorporating multimodal signals improves routing quality. Empirically, these cues improve the cost-accuracy frontier and enable the routed system to exceed the strongest single model's accuracy at roughly 33% of its cost. Furthermore, policies trained on a subset of models and tasks generalize zero-shot to new datasets and text-only benchmarks without retuning, establishing MMR-Bench as a foundation for studying adaptive multimodal model selection and efficient MLLM deployment. The code will be available at: https://github.com/Hunter-Wrynn/MMR-Bench.

翻译：多模态大语言模型（MLLMs）发展迅速，但由于架构、对齐策略和效率方面的异质性，没有单一模型能在所有任务上均表现最优。在实际部署中，工作负载涵盖从轻量级OCR到复杂的多模态推理；对所有查询使用同一个MLLM，要么会在简单实例上过度配置计算资源，要么会在困难实例上牺牲准确性。查询级模型选择（路由）解决了这一矛盾，但由于模态融合、不同模型间计算成本的巨大差异以及缺乏标准化的、考虑预算的评估，将路由从纯文本LLM扩展到MLLM并非易事。我们提出了MMR-Bench，这是一个统一的基准，旨在隔离多模态路由问题，并支持在固定的候选模型集和成本模型下进行比较。MMR-Bench提供了：（i）一个具有模态感知输入和可变计算预算的受控环境，（ii）一套涵盖OCR、通用VQA和多模态数学推理的广泛视觉-语言任务集，以及（iii）强大的单模型参考、理论上限以及代表性的路由策略。利用MMR-Bench，我们证明了融入多模态信号能提升路由质量。实验表明，这些线索改善了成本-准确率边界，并使路由系统能够以最强单模型约33%的成本，超越其准确性。此外，在部分模型和任务上训练的策略，能够零样本泛化到新的数据集和纯文本基准，而无需重新调优，这确立了MMR-Bench作为研究自适应多模态模型选择和高效MLLM部署的基础。代码将发布于：https://github.com/Hunter-Wrynn/MMR-Bench。