Inference-Time Scaling for Joint Audio-Video Generation

Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial training resources to improve fidelity, Inference-Time Scaling (ITS) has recently emerged as a promising training-free alternative in single-modality domains. However, extending ITS from a single modality to multimodal domains is non-trivial, as it requires balancing multiple heterogeneous objectives. In this paper, we present the first comprehensive study of ITS for joint audio-video generation. We first demonstrate that a multi-verifier framework is essential to address the limitations of single-objective guidance, including asymmetric performance trade-offs and verifier hacking. Through systematic analysis, we then identify an optimal multi-verifier combination that yields balanced improvements across all quality dimensions. Finally, to effectively aggregate diverse reward signals, we propose Adaptive Reward Weighting (ARW), a novel test-time optimization algorithm. ARW treats reward aggregation as an online optimization problem, utilizing learnable parameters to calibrate reward variances without requiring prior knowledge of reward distributions, thereby ensuring robust multi-objective selection. Experimental results on VGGSound and JavisBench-mini benchmarks demonstrate that our framework significantly enhances semantic alignment, perceptual quality, and audio-visual synchronization of generated outputs. Synthesized samples and code are available on the project page: https://jung-jaemin.github.io/ITS-AVGen-Proj.

翻译：音视频联合生成旨在合成既与文本提示语义对齐又精确同步的逼真音频-视频对。现有音视频联合生成模型通常需要大量训练资源来提升保真度，而推理时缩放（ITS）近期作为单模态领域内一种有前景的无训练替代方案崭露头角。然而，将ITS从单模态扩展到多模态领域并非易事，因其需要平衡多个异构目标。本文首次对ITS在音视频联合生成中的应用展开系统性研究。我们首先证明，多验证器框架对于解决单目标引导的局限性（包括非对称性能权衡与验证器攻击）至关重要。通过系统分析，我们随后确定了一种最优多验证器组合，该组合能在所有质量维度上实现均衡提升。最后，为有效聚合多样化奖励信号，我们提出自适应奖励加权（ARW）——一种新颖的测试时优化算法。ARW将奖励聚合视为在线优化问题，利用可学习参数校准奖励方差，无需奖励分布的先验知识，从而确保鲁棒的多目标选择。在VGGSound和JavisBench-mini基准上的实验结果表明，我们的框架显著提升了生成输出的语义对齐性、感知质量及音视频同步性。合成样本与代码见项目页面：https://jung-jaemin.github.io/ITS-AVGen-Proj。