ByteLoom：通过渐进式课程学习编织几何一致的人-物交互 (ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning)

Human-object interaction (HOI) video generation has garnered increasing attention due to its promising applications in digital humans, e-commerce, advertising, and robotics imitation learning. However, existing methods face two critical limitations: (1) a lack of effective mechanisms to inject multi-view information of the object into the model, leading to poor cross-view consistency, and (2) heavy reliance on fine-grained hand mesh annotations for modeling interaction occlusions. To address these challenges, we introduce ByteLoom, a Diffusion Transformer (DiT)-based framework that generates realistic HOI videos with geometrically consistent object illustration, using simplified human conditioning and 3D object inputs. We first propose an RCM-cache mechanism that leverages Relative Coordinate Maps (RCM) as a universal representation to maintain object's geometry consistency and precisely control 6-DoF object transformations in the meantime. To compensate HOI dataset scarcity and leverage existing datasets, we further design a training curriculum that enhances model capabilities in a progressive style and relaxes the demand of hand mesh. Extensive experiments demonstrate that our method faithfully preserves human identity and the object's multi-view geometry, while maintaining smooth motion and object manipulation.

翻译：人-物交互（HOI）视频生成因其在数字人、电子商务、广告和机器人模仿学习中的广阔应用前景而日益受到关注。然而，现有方法面临两个关键局限：（1）缺乏将物体的多视角信息有效注入模型的机制，导致跨视角一致性差；（2）严重依赖精细的手部网格标注来建模交互遮挡。为应对这些挑战，我们提出了ByteLoom，一个基于扩散Transformer（DiT）的框架，该框架使用简化的人体条件输入和3D物体输入，生成具有几何一致物体描绘的真实感HOI视频。我们首先提出了一种RCM缓存机制，该机制利用相对坐标图（RCM）作为一种通用表示，以维持物体的几何一致性，并同时精确控制物体的6自由度变换。为了弥补HOI数据集的稀缺性并充分利用现有数据集，我们进一步设计了一种渐进式训练课程，以逐步增强模型能力，并降低对手部网格标注的需求。大量实验表明，我们的方法能够忠实保持人体身份与物体的多视角几何特性，同时维持流畅的运动和物体操控。