Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.
翻译:视频世界模型在模拟物理世界方面展现出巨大潜力,然而现有记忆机制主要将环境视为静态画布。当动态主体隐藏于视野之外而后重新出现时,现有方法往往难以应对,导致主体冻结、失真甚至消失。为解决这一问题,我们提出混合记忆(Hybrid Memory)这一新型范式,要求模型同时扮演静态背景的精确归档者和动态主体的警觉追踪者,确保主体在视距外区间内的运动连续性。为促进该方向的研究,我们构建了HM-World——首个专用于混合记忆的大规模视频数据集。该数据集包含59K条高保真视频片段,解耦了相机与主体轨迹,涵盖17个多样化场景、49个不同主体以及精心设计的进出事件,以严格评估混合连贯性。此外,我们提出HyDRA——一种专用记忆架构,将记忆压缩为令牌,并利用时空相关性驱动的检索机制。通过选择性关注相关运动线索,HyDRA有效保留了隐藏主体的身份与运动信息。在HM-World上的大量实验表明,我们的方法在动态主体一致性和整体生成质量上均显著优于现有最优方法。