Lightweight vision-language models perform competitively on standard benchmarks yet fail systematically in dense-scene reasoning, where multiple objects, attributes, and relations must be jointly grounded and resolved through multi-step inference. Such capability is critical for real-world applications where models must reliably interpret cluttered environments. Yet existing training signals provide no explicit grounding between reasoning steps and the underlying visual entities and relations, leaving lightweight models free to generate fluent but visually unanchored reasoning chains. To address this gap, we first introduce DRBench, a benchmark of 14,573 questions across 2,943 images, organized into five task categories spanning three progressive reasoning layers. Building on DRBench, we propose DRScaffold, a supervised fine-tuning framework that decomposes the supervision target into four causally ordered stages, enforcing grounded reasoning without architectural modification. Experiments on three lightweight VLMs demonstrate substantial gains on DRBench while preserving or improving performance on general-purpose benchmarks. Notably, Qwen2.5-VL-3B trained with DRScaffold surpasses the frozen Qwen2.5-VL-32B on DRBench, demonstrating that structured supervision can substitute for a significant portion of model scale in dense-scene reasoning. Our code and models are available at https://github.com/irene-shi/DRScaffold .
翻译:轻量级视觉语言模型在标准基准测试中表现优异,但在密集场景推理任务中系统性地失效——这类任务要求通过多步推理,将多个物体、属性及其关系进行联合具身化与解析。该能力对于模型在现实应用中可靠解读复杂环境至关重要。然而现有训练信号未能在推理步骤与底层视觉实体及其关系之间建立显式关联,导致轻量级模型能够生成流畅但缺乏视觉锚点的推理链。为填补这一空白,我们首先构建了DRBench基准数据集,包含2,943张图像上的14,573个问题,按照五个任务类别组织,横跨三个渐进式推理层次。基于DRBench,我们提出DRScaffold框架,这是一种监督微调框架,将监督目标分解为四个因果排序阶段,在不修改模型架构的前提下强制实现具身化推理。在三个轻量级视觉语言模型上的实验表明,该方法在DRBench上取得了显著提升,同时保持或改善了通用基准性能。值得注意的是,经DRScaffold训练的Qwen2.5-VL-3B在DRBench上超越了冻结参数的Qwen2.5-VL-32B,证明结构化监督可替代密集场景推理中相当一部分模型规模优势。我们的代码与模型已开源至https://github.com/irene-shi/DRScaffold 。