We introduce MDSGen, a novel framework for vision-guided open-domain sound generation optimized for model parameter size, memory consumption, and inference speed. This framework incorporates two key innovations: (1) a redundant video feature removal module that filters out unnecessary visual information, and (2) a temporal-aware masking strategy that leverages temporal context for enhanced audio generation accuracy. In contrast to existing resource-heavy Unet-based models, MDSGen employs denoising masked diffusion transformers, facilitating efficient generation without reliance on pre-trained diffusion models. Evaluated on the benchmark VGGSound dataset, our smallest model (5M parameters) achieves 97.9% alignment accuracy, using 172x fewer parameters, 371% less memory, and offering 36x faster inference than the current 860M-parameter state-of-the-art model (93.9% accuracy). The larger model (131M parameters) reaches nearly 99% accuracy while requiring 6.5x fewer parameters. These results highlight the scalability and effectiveness of our approach.
翻译:本文提出MDSGen,一种面向视觉引导开放域声音生成的新型框架,该框架在模型参数量、内存消耗和推理速度方面均进行了优化。该框架包含两项关键创新:(1) 冗余视频特征去除模块,用于过滤不必要的视觉信息;(2) 时序感知掩码策略,利用时序上下文提升音频生成精度。与现有基于Unet的资源密集型模型相比,MDSGen采用去噪掩码扩散Transformer,无需依赖预训练扩散模型即可实现高效生成。在基准数据集VGGSound上的评估表明,我们最小的模型(参数量5M)达到了97.9%的对齐准确率,其参数量减少172倍,内存消耗降低371%,推理速度比当前860M参数量的最先进模型(准确率93.9%)提升36倍。更大规模的模型(参数量131M)在参数量减少6.5倍的同时,准确率接近99%。这些结果凸显了我们方法的可扩展性与有效性。