Imagining the future trajectory is the key for robots to make sound planning and successfully reach their goals. Therefore, text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning. To tackle this task and empower robots with the ability to foresee the future, we propose a sample and computation-efficient model, named \textbf{Seer}, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. We enhance the U-Net and language conditioning model by incorporating computation-efficient spatial-temporal attention. Furthermore, we introduce a novel Frame Sequential Text Decomposer module that dissects a sentence's global instruction into temporally aligned sub-instructions, ensuring precise integration into each frame of generation. Our framework allows us to effectively leverage the extensive prior knowledge embedded in pretrained T2I models across the frames. With the adaptable-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a few layers on a small amount of data. The experimental results on Something Something V2 (SSv2), Bridgedata and EpicKitchens-100 datasets demonstrate our superior video prediction performance with around 480-GPU hours versus CogVideo with over 12,480-GPU hours: achieving the 31% FVD improvement compared to the current SOTA model on SSv2 and 83.7% average preference in the human evaluation.
翻译:想象未来轨迹是机器人制定合理规划并成功达成目标的关键。因此,文本条件视频预测(TVP)是促进通用机器人策略学习的关键任务。为应对该挑战并赋予机器人预见未来的能力,我们提出了一种名为**Seer**的高效样本与计算模型,通过沿时间轴扩展预训练的文本到图像(T2I)稳定扩散模型实现。我们通过引入计算高效的空间-时间注意力机制,增强了U-Net与语言条件模型。此外,我们提出了一种新颖的帧序列文本分解模块,将句子的全局指令解析为时间对齐的子指令,确保精确融入每一帧的生成过程。该框架能够有效利用预训练T2I模型中嵌入的广泛先验知识,跨越帧间信息。凭借可适应设计的架构,Seer仅需在少量数据上微调少数层即可生成高保真、连贯且与指令对齐的视频帧。在Something Something V2(SSv2)、Bridgedata和EpicKitchens-100数据集上的实验结果表明,我们的视频预测性能显著优于现有方法:相比当前SSv2上的最优模型(SOTA),FVD指标提升31%,人类评估平均偏好率达83.7%,且仅需约480 GPU小时训练时间,而CogVideo需超过12,480 GPU小时。