Significant advances have been made in human-centric video generation, yet the joint video-depth generation problem remains underexplored. Most existing monocular depth estimation methods may not generalize well to synthesized images or videos, and multi-view-based methods have difficulty controlling the human appearance and motion. In this work, we present IDOL (unIfied Dual-mOdal Latent diffusion) for high-quality human-centric joint video-depth generation. Our IDOL consists of two novel designs. First, to enable dual-modal generation and maximize the information exchange between video and depth generation, we propose a unified dual-modal U-Net, a parameter-sharing framework for joint video and depth denoising, wherein a modality label guides the denoising target, and cross-modal attention enables the mutual information flow. Second, to ensure a precise video-depth spatial alignment, we propose a motion consistency loss that enforces consistency between the video and depth feature motion fields, leading to harmonized outputs. Additionally, a cross-attention map consistency loss is applied to align the cross-attention map of the video denoising with that of the depth denoising, further facilitating spatial alignment. Extensive experiments on the TikTok and NTU120 datasets show our superior performance, significantly surpassing existing methods in terms of video FVD and depth accuracy.
翻译:在人体中心视频生成领域已取得显著进展,然而联合视频-深度生成问题仍未得到充分探索。现有的大多数单目深度估计方法可能难以泛化至合成图像或视频,而基于多视图的方法则难以控制人体外观与运动。本工作提出IDOL(统一双模态潜在扩散模型),用于实现高质量的人体中心联合视频-深度生成。我们的IDOL包含两项创新设计:首先,为实现双模态生成并最大化视频与深度生成间的信息交互,我们提出统一双模态U-Net——一种参数共享的联合视频与深度去噪框架,其中模态标签指导去噪目标,跨模态注意力机制则实现双向信息流。其次,为确保视频与深度数据的精确空间对齐,我们提出运动一致性损失,通过约束视频与深度特征运动场的一致性实现输出协调。此外,采用跨注意力图一致性损失来对齐视频去噪与深度去噪的跨注意力图,进一步促进空间对齐。在TikTok与NTU120数据集上的大量实验表明,我们的方法在视频FVD指标与深度精度方面显著超越现有方法,展现出卓越性能。