重新审视视觉智能：视频预训练的启示 (Rethinking Visual Intelligence: Insights from Video Pretraining)

Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.

翻译：大型语言模型（LLMs）已证明，大规模预训练能使系统在语言领域以极少监督快速适应新问题。然而，这一成功并未在视觉领域得到同等有效的转化，包括LLMs在内的模型在组合理解、样本效率和通用问题解决方面仍面临挑战。我们探讨了视频扩散模型（VDMs）作为弥合这一差距的有前景方向。在时空数据上的预训练赋予这些模型对结构和动态的强归纳偏置，我们假设这能支持广泛的任务适应性。为验证此假设，我们设计了一项受控评估：将预训练的LLM和预训练的VDM均配备轻量级适配器，并在其自然模态中执行任务。在包括ARC-AGI、ConceptARC、视觉游戏、路径规划和元胞自动机的基准测试中，VDMs表现出比语言模型更高的数据效率。综合来看，我们的结果表明，视频预训练提供的归纳偏置有助于推动视觉基础模型的发展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/