JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions

We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present JenBridge, a modular and interpretable framework for adaptive long-form video soundtracking that ensures both high-fidelity audio generation and transition naturalness. The core architecture is a Transformer-based generative model trained with a flow-matching objective, following a two-stage paradigm: pretraining on large-scale text-audio corpora to establish robust musical priors, then adapting to the video domain with dual text-visual conditioning for precise cross-modal alignment. Crucially, to achieve long-form coherence across diverse scene changes, JenBridge incorporates a novel adaptive transition mechanism. This system features a versatile toolkit of transition styles, including a generative transition method, and uniquely employs a Large Language Model (LLM) Agent that acts as a director to select the most appropriate transition for each narrative shift intelligently. To rigorously assess this task, we propose the LVS Benchmark, a new benchmark that includes a curated dataset and novel evaluation metrics focusing on holistic and transition-aware assessment. Extensive experiments on the proposed benchmark demonstrate that JenBridge significantly outperforms existing methods in both objective and subjective metrics, particularly in terms of transition naturalness and overall narrative coherence. JenBridge represents a significant step towards fully automated, professional-quality video soundtracking.

翻译：我们解决了在场景转换中生成高保真、连贯的长格式配乐的挑战。现有AI音乐系统主要针对短小、孤立的片段设计，缺乏确保叙事连续性的机制。我们提出JenBridge——一种模块化且可解释的自适应长视频配乐框架，能同时保证高保真音频生成与转换自然度。其核心架构是基于Transformer的生成模型，采用流匹配目标进行训练，遵循两阶段范式：在大规模文本-音频语料库上预训练以建立稳健的音乐先验，随后通过双文本-视觉条件约束适应视频领域，实现精准的跨模态对齐。关键在于，为在多样化场景变化中实现长格式连贯性，JenBridge引入了新型自适应转换机制。该系统配备包含生成式转换方法在内的多功能转换风格工具包，并独特地采用大语言模型智能体充当导演角色，智能性地为每个叙事转变选择最合适的转换方式。为严格评估该任务，我们提出LVS基准——包含精选数据集与新型评估指标，聚焦整体性和转换感知评估。在提出的基准上进行的广泛实验表明，JenBridge在客观与主观指标上均显著优于现有方法，尤其在转换自然度与整体叙事连贯性方面表现突出。JenBridge标志着向全自动、专业级视频配乐迈出的重要一步。