Autoregressive world models have emerged as a powerful paradigm for interactive video generation, allowing users to navigate dynamically generated environments through actions. These models are typically conditioned on a text prompt and/or a single reference frame, from which the entire world is generated. Yet the moment the user navigates beyond what is visible in that frame, the unseen regions are populated by the base model's priors, with no mechanism for the user to specify what should appear and where. This is a fundamental limitation for applications such as gaming, interactive storytelling, and simulation, where controllable scene composition is essential. We refer to this missing capability as concept spawning; introducing a user-specified visual concept into a world model, analogous to spawning in a game engine. We introduce SPAWN (Swapping Pinned Anchor with Windowed iNjection), a training-free method for concept spawning. SPAWN exploits a structural property of image-to-video backbones: the first slot of the context memory is pinned to the reference frame and acts as a foundational anchor for every generated chunk. By swapping this anchor with an external concept latent over a short injection window and letting the original anchor return, we cause the concept to propagate naturally through the rollout via the model's own memory. SPAWN supports concepts from fine-grained entities such as characters and props to large-scale elements such as buildings and landmarks, and accepts either a concept image or a text description as input. Experiments show that SPAWN integrates concepts with consistent lighting, scale, and perspective while preserving identity and temporal coherence, demonstrating that controllable concept spawning is achievable in existing autoregressive world models without any training.
翻译:自回归世界模型已成为交互式视频生成的一种强大范式,允许用户通过动作在动态生成的环境中进行导航。这些模型通常以文本提示和/或单一参考帧为条件,由此生成整个世界。然而,一旦用户导航到该帧可见范围之外,未见的区域将由基础模型的先验知识填充,而用户无法指定应该出现什么内容以及出现在何处。对于游戏、交互式叙事和模拟等应用而言,这是一个根本性缺陷,因为这些应用需要可控的场景组成。我们将这种缺失的能力称为概念生成:将用户指定的视觉概念引入世界模型,类似于在游戏引擎中生成对象。我们提出了SPAWN(Swapping Pinned Anchor with Windowed iNjection),一种无需训练的概念生成方法。SPAWN利用了图像到视频骨干网络的结构特性:上下文记忆的第一个槽位被固定到参考帧,并作为每个生成块的锚点。通过在短注入窗口内用外部概念潜变量替换该锚点,然后让原始锚点返回,概念便通过模型自身的记忆在 rollout 中自然传播。SPAWN支持从角色、道具等细粒度实体到建筑物、地标等大规模元素的概念,并可接受概念图像或文本描述作为输入。实验表明,SPAWN能够在保持身份一致性和时间连贯性的同时,以一致的光照、尺度和透视整合概念,证明了在现有自回归世界模型中无需任何训练即可实现可控概念生成。