Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.
翻译:视觉语言模型(VLM)在空间理解、视角识别等视觉感知任务中仍面临挑战。一个可行因素是自然图像数据集对低层视觉技能的监督有限。这引出一个实际问题:仅通过"深度顺序"等任务关键词生成的靶向合成监督,能否弥补这些不足?为探究此问题,我们提出VisionFoundry——一种任务感知合成数据生成流水线,仅需任务名称作为输入,利用大语言模型(LLM)生成问题、答案及文生图(T2I)提示,再用T2I模型合成图像并通过专有VLM验证一致性,无需参考图像或人工标注。基于VisionFoundry,我们构建了包含10,000个图像-问题-答案三元组、覆盖10项任务的合成视觉问答(VQA)数据集VisionFoundry-10K。在其上训练的模型在视觉感知基准上取得显著提升:MMVP提升7%,CV-Bench-3D提升10%,同时保持泛化能力并展现出随数据规模增长的良好扩展性。结果表明,有限的任务靶向监督是导致该瓶颈的重要因素,而合成监督为VLMs的系统化训练提供了可行路径。