Despite the recent strides in video generation, state-of-the-art methods still struggle with elements of visual detail. One particularly challenging case is the class of egocentric instructional videos in which the intricate motion of the hand coupled with a mostly stable and non-distracting environment is necessary to convey the appropriate visual action instruction. To address these challenges, we introduce a new method for instructional video generation. Our diffusion-based method incorporates two distinct innovations. First, we propose an automatic method to generate the expected region of motion, guided by both the visual context and the action text. Second, we introduce a critical hand structure loss to guide the diffusion model to focus on smooth and consistent hand poses. We evaluate our method on augmented instructional datasets based on EpicKitchens and Ego4D, demonstrating significant improvements over state-of-the-art methods in terms of instructional clarity, especially of the hand motion in the target region, across diverse environments and actions.Video results can be found on the project webpage: https://excitedbutter.github.io/Instructional-Video-Generation/
翻译:尽管近期视频生成领域取得了显著进展,但最先进的方法在视觉细节方面仍存在不足。一个特别具有挑战性的案例是第一人称教学视频类别,其中需要结合手部精细动作与基本稳定且无干扰的环境,以传达恰当的动作视觉指令。为应对这些挑战,我们提出了一种新的教学视频生成方法。我们的基于扩散的方法包含两项创新:首先,我们提出一种自动生成预期运动区域的方法,该方法由视觉上下文和动作文本共同引导;其次,我们引入关键的手部结构损失函数,以引导扩散模型专注于平滑且连贯的手部姿态。我们在基于EpicKitchens和Ego4D增强的教学数据集上评估了所提方法,结果表明在多样化环境和动作中,本方法在指令清晰度(特别是目标区域内手部动作的呈现)方面较现有最优方法有显著提升。视频结果可访问项目网页:https://excitedbutter.github.io/Instructional-Video-Generation/