While Multimodal Large Language Models (MLLMs) excel in cross-modal reasoning, they often struggle to perceive fine-grained details in complex high-resolution images. Recent training-free methods address this through image scaling and localized cropping. However, applying these manipulations indiscriminately introduces computational redundancy for simple queries and can degrade accuracy by truncating essential global context or introducing irrelevant background noise. To this end, we propose LazyMCoT, a dynamic and training-free framework that adaptively allocates visual grounding efforts based on sample difficulty. The framework features an Adaptive Routing mechanism that evaluates predictive uncertainty using first-token statistics from a single forward pass. This efficiently bypasses confident cases while ensuring the recall of difficult samples via conformal calibration. For these challenging cases, a Collaborative Grounding module integrates the inherent cross-modal attention of the model with an external visual expert through a two-stage refinement process. This refinement process generates a precise localized display to recover small or occluded targets. Extensive experiments across diverse benchmarks demonstrate that LazyMCoT rivals training-based approaches by simultaneously improving reasoning accuracy and reducing average inference latency. Our code is availble at https://github.com/TencentBAC/LazyMCoT.
翻译:尽管多模态大语言模型在跨模态推理中表现出色,但在处理复杂高分辨率图像的细粒度细节时往往力不从心。现有无需训练的方法通过图像缩放和局部裁剪来解决这一问题。然而,不加区分地应用这些操作会为简单查询引入计算冗余,并可能因截断必要的全局上下文或引入无关背景噪声而降低准确率。为此,我们提出LazyMCoT,一种动态且无需训练的框架,能够根据样本难度自适应分配视觉定位资源。该框架包含自适应路由机制,利用单次前向传递中的首个令牌统计量评估预测不确定性,通过符合性标定高效绕过置信度高的案例,同时确保召回困难样本。针对这些困难案例,协作定位模块通过两阶段精炼流程,将模型固有的跨模态注意力与外部视觉专家相结合。该精炼过程生成精确的局部化显示,以恢复小尺寸或被遮挡的目标。在多个基准数据集上的广泛实验表明,LazyMCoT在提升推理准确率与降低平均推理延迟方面均能与基于训练的方法相媲美。我们的代码已开源:https://github.com/TencentBAC/LazyMCoT。