From Zero to Hero: Training-Free Custom Concept Spawning in World Models

Autoregressive world models have emerged as a powerful paradigm for interactive video generation, allowing users to navigate dynamically generated environments through actions. These models are typically conditioned on a text prompt and/or a single reference frame, from which the entire world is generated. Yet the moment the user navigates beyond what is visible in that frame, the unseen regions are populated by the base model's priors, with no mechanism for the user to specify what should appear and where. This is a fundamental limitation for applications such as gaming, interactive storytelling, and simulation, where controllable scene composition is essential. We refer to this missing capability as concept spawning; introducing a user-specified visual concept into a world model, analogous to spawning in a game engine. We introduce SPAWN (Swapping Pinned Anchor with Windowed iNjection), a training-free method for concept spawning. SPAWN exploits a structural property of image-to-video backbones: the first slot of the context memory is pinned to the reference frame and acts as a foundational anchor for every generated chunk. By swapping this anchor with an external concept latent over a short injection window and letting the original anchor return, we cause the concept to propagate naturally through the rollout via the model's own memory. SPAWN supports concepts from fine-grained entities such as characters and props to large-scale elements such as buildings and landmarks, and accepts either a concept image or a text description as input. Experiments show that SPAWN integrates concepts with consistent lighting, scale, and perspective while preserving identity and temporal coherence, demonstrating that controllable concept spawning is achievable in existing autoregressive world models without any training.

翻译：自回归世界模型已成为交互式视频生成的一种强大范式，允许用户通过动作在动态生成的环境中进行导航。这些模型通常以文本提示和/或单一参考帧为条件，由此生成整个世界。然而，一旦用户导航到该帧可见范围之外，未见的区域将由基础模型的先验知识填充，而用户无法指定应该出现什么内容以及出现在何处。对于游戏、交互式叙事和模拟等应用而言，这是一个根本性缺陷，因为这些应用需要可控的场景组成。我们将这种缺失的能力称为概念生成：将用户指定的视觉概念引入世界模型，类似于在游戏引擎中生成对象。我们提出了SPAWN（Swapping Pinned Anchor with Windowed iNjection），一种无需训练的概念生成方法。SPAWN利用了图像到视频骨干网络的结构特性：上下文记忆的第一个槽位被固定到参考帧，并作为每个生成块的锚点。通过在短注入窗口内用外部概念潜变量替换该锚点，然后让原始锚点返回，概念便通过模型自身的记忆在 rollout 中自然传播。SPAWN支持从角色、道具等细粒度实体到建筑物、地标等大规模元素的概念，并可接受概念图像或文本描述作为输入。实验表明，SPAWN能够在保持身份一致性和时间连贯性的同时，以一致的光照、尺度和透视整合概念，证明了在现有自回归世界模型中无需任何训练即可实现可控概念生成。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【综述】世界模型：架构、方法、推理与应用全景

专知会员服务

27+阅读 · 6月2日

【书籍】从零开始构建文本生成图像生成器：基于 Transformers 与扩散模型

专知会员服务

25+阅读 · 2025年12月27日

从二维到三维认知：通用世界模型简要综述

专知会员服务

30+阅读 · 2025年6月26日

自动驾驶的世界模型综述

专知会员服务

47+阅读 · 2025年1月22日