In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.
翻译:本文研究人-物交互视频生成任务,旨在基于文本、参考图像、音频和姿态等条件,合成高质量的人-物交互视频。该任务在电商演示、短视频制作和互动娱乐等实际应用中,对自动化内容创作具有重要实用价值。然而,现有方法无法同时适配所有这些必要条件。我们提出OmniShow——一个专为此类实用且具挑战性任务设计的端到端框架,能够协调多模态条件并实现工业级性能。为克服可控性与质量之间的权衡,我们引入统一通道条件化机制以实现高效图像与姿态注入,并提出门控局部上下文注意力机制确保精确的视听同步。针对数据稀缺问题,我们开发了"解耦-联合"训练策略,通过多阶段训练与模型合并高效利用异构子任务数据集。此外,为填补该领域的评估空白,我们建立了HOIVG-Bench——一个专用且全面的人-物交互视频生成基准。大量实验表明,OmniShow在多种多模态条件设置下均达到整体最优性能,为新兴的HOIVG任务树立了坚实标准。