Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.
翻译:扩展视频与音频之间的多模态对齐具有挑战性,特别是由于数据有限以及文本描述与帧级视频信息之间的不匹配。在本工作中,我们应对多模态到音频生成中的扩展挑战,探究在测试阶段,基于短实例训练的模型能否泛化至更长的实例。为应对此挑战,我们提出了多模态层次网络(MMHNet),这是对现有最先进的视频到音频模型的增强扩展。我们的方法整合了层次化方法和非因果Mamba以支持长格式音频生成。我们提出的方法显著改善了长音频生成能力,可达5分钟以上。我们还证明了在视频到音频生成任务中,无需在更长时长数据上训练即可实现"短训练、长测试"。实验表明,我们提出的方法在长视频到音频基准测试中取得了显著成果,超越了先前视频到音频任务中的工作。此外,我们展示了模型生成超过5分钟音频的能力,而先前的视频到音频方法在生成长时长内容方面存在不足。