Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending these models to fine-grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task-specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality. In this work, we challenge this paradigm by accommodating standard MLLMs to perform dense predictions without requiring additional task-specific decoders. The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks. Despite its minimalist design, our model achieves highly competitive performance across a wide range of dense prediction and vision-language benchmarks, demonstrating that a standard, general-purpose MLLM can effectively support dense perception without architectural specialization.
翻译:多模态大语言模型(MLLMs)已在高层视觉理解任务中展现出卓越能力。然而,将这些模型扩展到细粒度密集预测任务(如语义分割和深度估计)通常需要引入复杂的、任务特定的解码器及其他定制化组件。这种架构上的碎片化增加了模型复杂性,偏离了MLLMs的通才设计理念,最终限制了其实用性。在本研究中,我们通过适配标准MLLMs使其无需额外任务特定解码器即可执行密集预测,从而挑战了这一范式。所提出的模型称为DenseMLLM,其基于标准架构,并采用了一种面向多标签多任务的新型视觉令牌监督策略。尽管设计极简,我们的模型在广泛的密集预测和视觉-语言基准测试中均取得了极具竞争力的性能,证明标准的通用MLLM无需架构特化即可有效支持密集感知。