Human-centric video generation has advanced rapidly, yet existing methods struggle to produce controllable and physically consistent Human-Object Interaction (HOI) videos. Existing works rely on dense control signals, template videos, or carefully crafted text prompts, which limit flexibility and generalization to novel objects. We introduce a framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box. This lightweight guidance alleviates the imbalance between human and object representations and enables intuitive user control. To enhance fidelity under such sparse conditions, we propose an Object-Stressed Attention mechanism that improves object robustness. To address the scarcity of high-quality HOI data, we further develop a Multi-Task Auxiliary Training strategy with a dedicated data curation pipeline, allowing the model to benefit from both reliable HOI samples and auxiliary tasks. Comprehensive experiments show that our method achieves high-fidelity, controllable HOI generation across diverse tasks. The project page can be found at \href{https://mumuwei.github.io/DISPLAY/}.
翻译:以人为中心的视频生成技术发展迅速,但现有方法难以生成可控且物理一致的人-物交互视频。现有工作依赖于密集控制信号、模板视频或精心设计的文本提示,这限制了其对新物体的灵活性与泛化能力。我们提出了一个名为DISPLAY的框架,该框架由稀疏运动引导驱动,仅包含腕关节坐标和与形状无关的物体边界框。这种轻量级引导缓解了人与物体表征之间的不平衡,并实现了直观的用户控制。为了在此类稀疏条件下提升生成保真度,我们提出了一种物体强化注意力机制,以增强物体的鲁棒性。针对高质量人-物交互数据稀缺的问题,我们进一步开发了结合专用数据整理流程的多任务辅助训练策略,使模型能够同时从可靠的人-物交互样本和辅助任务中受益。综合实验表明,我们的方法在多样化任务中实现了高保真度、可控的人-物交互视频生成。项目页面可在 \href{https://mumuwei.github.io/DISPLAY/} 查看。