The domain of automatic video trailer generation is currently undergoing a profound paradigm shift, transitioning from heuristic-based extraction methods to deep generative synthesis. While early methodologies relied heavily on low-level feature engineering, visual saliency, and rule-based heuristics to select representative shots, recent advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), and diffusion-based video synthesis have enabled systems that not only identify key moments but also construct coherent, emotionally resonant narratives. This survey provides a comprehensive technical review of this evolution, with a specific focus on generative techniques including autoregressive Transformers, LLM-orchestrated pipelines, and text-to-video foundation models like OpenAI's Sora and Google's Veo. We analyze the architectural progression from Graph Convolutional Networks (GCNs) to Trailer Generation Transformers (TGT), evaluate the economic implications of automated content velocity on User-Generated Content (UGC) platforms, and discuss the ethical challenges posed by high-fidelity neural synthesis. By synthesizing insights from recent literature, this report establishes a new taxonomy for AI-driven trailer generation in the era of foundation models, suggesting that future promotional video systems will move beyond extractive selection toward controllable generative editing and semantic reconstruction of trailers.
翻译:自动视频预告片生成领域正经历深刻的范式转变,从基于启发式的抽取方法过渡到深度生成式合成。早期的研究方法主要依赖低层特征工程、视觉显著性以及基于规则的启发式方法来选取代表性镜头,而近期大型语言模型(LLM)、多模态大语言模型(MLLM)和基于扩散模型的视频合成技术的进展,使得系统不仅能够识别关键时刻,还能构建连贯且富有情感共鸣的叙事。本综述对该演变过程进行了全面的技术性回顾,特别关注包括自回归Transformer、LLM编排流水线以及文本到视频基础模型(如OpenAI的Sora和Google的Veo)在内的生成式技术。我们分析了从图卷积网络(GCN)到预告片生成Transformer(TGT)的架构演进,评估了自动化内容生成速率对用户生成内容(UGC)平台的经济影响,并讨论了高保真神经合成带来的伦理挑战。通过综合近期文献的见解,本报告为基础模型时代的人工智能驱动预告片生成建立了一种新的分类体系,并提出未来的推广视频系统将从抽取式选择转向可控的生成式编辑与预告片的语义重建。