Long-term video generation and prediction remain challenging tasks in computer vision, particularly in partially observable scenarios where cameras are mounted on moving platforms. The interaction between observed image frames and the motion of the recording agent introduces additional complexities. To address these issues, we introduce the Action-Conditioned Video Generation (ACVG) framework, a novel approach that investigates the relationship between actions and generated image frames through a deep dual Generator-Actor architecture. ACVG generates video sequences conditioned on the actions of robots, enabling exploration and analysis of how vision and action mutually influence one another in dynamic environments. We evaluate the framework's effectiveness on an indoor robot motion dataset which consists of sequences of image frames along with the sequences of actions taken by the robotic agent, conducting a comprehensive empirical study comparing ACVG to other state-of-the-art frameworks along with a detailed ablation study.
翻译:长期视频生成与预测仍是计算机视觉领域的挑战性任务,尤其在摄像机安装在移动平台上的部分可观测场景中。观测图像帧与记录主体运动之间的交互引入了额外的复杂性。为解决这些问题,我们提出基于动作条件的视频生成(ACVG)框架,这是一种通过深度双生成器-行动者架构探索动作与生成图像帧之间关系的新方法。ACVG以机器人动作为条件生成视频序列,从而实现对动态环境中视觉与动作如何相互影响的探索与分析。我们基于一个包含机器人运动序列(包括图像帧序列及其对应的动作序列)的室内机器人运动数据集评估该框架的有效性,并开展全面的实证研究,将ACVG与其他先进框架进行对比,同时进行详细的消融实验。