Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textit{Global Semantic Anchor} ensures stylistic stability, while a surgical \textit{Token-Level Affective Adapter} modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.
翻译:为长视频合成连贯配乐仍然是一项艰巨挑战,目前受限于三个关键障碍:计算可扩展性、时间连贯性,以及最关键的对演化叙事逻辑的普遍语义盲视。为弥合这些差距,我们提出NarraScore,一个基于核心洞见——即情感可作为叙事逻辑的高密度压缩——的分层框架。我们独特地重新利用冻结的视觉-语言模型作为连续情感传感器,将高维视觉流提炼为密集的、叙事感知的效价-唤醒轨迹。在机制上,NarraScore采用双分支注入策略来协调全局结构与局部动态:一个\textit{全局语义锚}确保风格稳定性,而一个精细的\textit{令牌级情感适配器}通过直接元素级残差注入来调节局部张力。这种极简设计绕过了密集注意力和架构克隆的瓶颈,有效缓解了与数据稀缺相关的过拟合风险。实验表明,NarraScore以可忽略的计算开销实现了最先进的一致性和叙事对齐,为长视频配乐生成建立了一个完全自主的范式。