Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.
翻译:尽管视觉-语言模型(VLMs)取得了显著进展,但当前架构在保留细粒度视觉信息方面往往存在局限,导致多模态理解趋于粗粒度。我们将此不足归因于主流VLM所固有的次优训练范式,该范式表现出文本主导的优化偏差,仅将视觉信号概念化为被动的条件输入而非监督目标。为缓解此问题,我们提出了Youtu-VL框架,该框架采用视觉-语言统一自回归监督(VLUAS)范式,从根本上将优化目标从“视觉作为输入”转变为“视觉作为目标”。通过将视觉标记直接整合到预测流中,Youtu-VL对视觉细节与语言内容同时施加统一的自回归监督。此外,我们将此范式扩展至涵盖以视觉为中心的任务,使标准VLM无需任务特定增补即可执行此类任务。大量实证评估表明,Youtu-VL在通用多模态任务和以视觉为中心的任务上均取得了具有竞争力的性能,为开发全面的通用视觉智能体奠定了坚实基础。