Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception

Multimodal Large Language Models (MLLMs) require high-resolution visual information to perform fine-grained perception, yet processing entire high-resolution images is computationally prohibitive. While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model's internal attention are computationally inefficient and less accurate, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process. In this paper, we propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. The SD-RPN is built around a pipeline that transforms the noisy attention maps from the MLLM's middle layers into high-quality pseudo-RoI labels by explicitly denoising the signal and resolving ambiguity. We use these labels to train a lightweight Region Proposal Network (RPN) that learns a more precise localization. This RPN is also highly efficient, predicting the RoI in a single forward pass using features from the MLLM's middle layers, decoupling RoI identification from the auto-regressive generation and avoiding costly multi-pass operations. To validate our approach, we integrate the framework into multiple MLLM families. Despite being trained on only a few (e.g. 10K) question-answer pairs, our method demonstrates exceptional data efficiency and generalization, achieving over a 10% absolute accuracy improvement on unseen benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a practical and scalable solution for enhancing the fine-grained perception of MLLMs without requiring costly supervision or full model fine-tuning. Code is available at https://github.com/YuHengsss/SD-RPN.

翻译：多模态大语言模型（MLLMs）需要高分辨率视觉信息以执行细粒度感知，但处理完整的高分辨率图像在计算上是不可行的。尽管近期方法利用区域兴趣（RoI）机制聚焦于显著区域，但它们通常面临一个困难的权衡：基于训练的方法依赖于大规模标注数据集，而利用模型内部注意力的免训练方法则计算效率低下且准确性不足，需要多轮预填充阶段或依赖缓慢的自回归解码过程。本文提出一种高效、无需标注的自蒸馏区域建议网络（SD-RPN），以解决这一权衡。SD-RPN的核心是一个处理流程，通过显式去噪信号和消除歧义，将MLLM中间层产生的噪声注意力图转化为高质量伪RoI标签。我们利用这些标签训练一个轻量级区域建议网络（RPN），使其学习更精确的定位。该RPN同样高度高效，仅需单次前向传播，利用MLLM中间层的特征预测RoI，从而将RoI识别与自回归生成解耦，并避免了昂贵的多轮操作。为验证我们的方法，我们将该框架集成到多个MLLM家族中。尽管仅使用少量（例如10K）问答对进行训练，我们的方法展现出卓越的数据效率和泛化能力，在未见过的基准测试（包括TextVQA、DocVQA和V-Star）上实现了超过10%的绝对准确率提升。我们的工作为增强MLLMs的细粒度感知提供了一种实用且可扩展的解决方案，无需昂贵的监督或全模型微调。代码发布于 https://github.com/YuHengsss/SD-RPN。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日