Human-object interaction (HOI) video generation has garnered increasing attention due to its promising applications in digital humans, e-commerce, advertising, and robotics imitation learning. However, existing methods face two critical limitations: (1) a lack of effective mechanisms to inject multi-view information of the object into the model, leading to poor cross-view consistency, and (2) heavy reliance on fine-grained hand mesh annotations for modeling interaction occlusions. To address these challenges, we introduce ByteLoom, a Diffusion Transformer (DiT)-based framework that generates realistic HOI videos with geometrically consistent object illustration, using simplified human conditioning and 3D object inputs. We first propose an RCM-cache mechanism that leverages Relative Coordinate Maps (RCM) as a universal representation to maintain object's geometry consistency and precisely control 6-DoF object transformations in the meantime. To compensate HOI dataset scarcity and leverage existing datasets, we further design a training curriculum that enhances model capabilities in a progressive style and relaxes the demand of hand mesh. Extensive experiments demonstrate that our method faithfully preserves human identity and the object's multi-view geometry, while maintaining smooth motion and object manipulation.
翻译:人-物交互(HOI)视频生成因其在数字人、电子商务、广告和机器人模仿学习中的广阔应用前景而日益受到关注。然而,现有方法面临两个关键局限:(1)缺乏将物体的多视角信息有效注入模型的机制,导致跨视角一致性差;(2)严重依赖精细的手部网格标注来建模交互遮挡。为应对这些挑战,我们提出了ByteLoom,一个基于扩散Transformer(DiT)的框架,该框架使用简化的人体条件输入和3D物体输入,生成具有几何一致物体描绘的真实感HOI视频。我们首先提出了一种RCM缓存机制,该机制利用相对坐标图(RCM)作为一种通用表示,以维持物体的几何一致性,并同时精确控制物体的6自由度变换。为了弥补HOI数据集的稀缺性并充分利用现有数据集,我们进一步设计了一种渐进式训练课程,以逐步增强模型能力,并降低对手部网格标注的需求。大量实验表明,我们的方法能够忠实保持人体身份与物体的多视角几何特性,同时维持流畅的运动和物体操控。