Human-object interaction (HOI) video generation has garnered increasing attention due to its promising applications in digital humans, e-commerce, advertising, and robotics imitation learning. However, existing methods face two critical limitations: (1) a lack of effective mechanisms to inject multi-view information of the object into the model, leading to poor cross-view consistency, and (2) heavy reliance on fine-grained hand mesh annotations for modeling interaction occlusions. To address these challenges, we introduce ByteLoom, a Diffusion Transformer (DiT)-based framework that generates realistic HOI videos with geometrically consistent object illustration, using simplified human conditioning and 3D object inputs. We first propose an RCM-cache mechanism that leverages Relative Coordinate Maps (RCM) as a universal representation to maintain object's geometry consistency and precisely control 6-DoF object transformations in the meantime. To compensate HOI dataset scarcity and leverage existing datasets, we further design a training curriculum that enhances model capabilities in a progressive style and relaxes the demand of hand mesh. Extensive experiments demonstrate that our method faithfully preserves human identity and the object's multi-view geometry, while maintaining smooth motion and object manipulation.
翻译:人-物交互(HOI)视频生成因其在数字人、电子商务、广告和机器人模仿学习等领域的应用前景而日益受到关注。然而,现有方法存在两个关键局限:(1)缺乏将物体多视角信息有效注入模型中的机制,导致跨视角一致性差;(2)在建模交互遮挡时严重依赖精细的手部网格标注。为解决这些挑战,我们提出了ByteLoom——一个基于扩散变换器(DiT)的框架,该框架利用简化的人体条件约束和3D物体输入,生成具有几何一致物体表征的逼真HOI视频。我们首先提出RCM缓存机制,利用相对坐标图(RCM)作为通用表征来维持物体几何一致性,同时精确控制物体的6自由度变换。为弥补HOI数据集稀缺性并充分利用现有数据集,我们进一步设计了渐进式训练课程,以分步增强模型能力,并降低对手部网格的强依赖。大量实验证明,本方法在保持平滑运动及物体操控的同时,能够准确保留人体身份特征与物体的多视角几何结构。