Video outpainting generates plausible visual content beyond the original spatial extent of a video, playing a key role in adapting videos to diverse display formats. To support such use cases, it must enable large spatial extrapolation over long sequences. However, most existing methods address only one of these challenges or lack explicit mechanisms for ensuring global spatio-temporal consistency, leading to notable limitations. In this paper, we propose HL-OutPaint, a high-resolution video outpainting framework for long sequences. Our approach follows a coarse-to-fine strategy with a two-stage pipeline. We first construct Global Coarse Guidance (GCG), a low-resolution representation that captures global structure and dominant motion across the video. Unlike naive downsampling, GCG is built via a novel global-local frame swapping mechanism that couples sparse global keyframes with local temporal windows and exchanges information during sampling. This enables GCG to encode both long-term structural consistency and short-term temporal dynamics in a unified representation. Guided by this representation, HL-OutPaint then performs high-resolution outpainting to generate spatially detailed and temporally consistent content. By separating global structure modeling from fine-grained synthesis, our framework achieves stable, coherent generation for large spatial expansion and long video sequences. Extensive experiments show that HL-OutPaint outperforms existing methods in challenging scenarios involving wide spatial extrapolation and long video sequences.
翻译:视频外扩技术旨在生成超出原始视频空间范围的合理视觉内容,在适配不同显示格式方面具有关键作用。为支持此类应用场景,该技术需要实现针对长序列的大范围空间外推。然而现有方法大多仅解决其中单一挑战,或缺乏确保全局时空一致性的显式机制,从而存在显著局限性。本文提出HL-OutPaint——一种面向长序列的高分辨率视频外扩框架。该方法采用从粗到精策略,包含两阶段流水线:首先构建全局粗粒度引导(GCG),这是一种可捕获视频全局结构与主导运动的低分辨率表征。与朴素下采样不同,GCG通过新颖的全局-局部帧交换机制构建,该机制将稀疏全局关键帧与局部时间窗相耦合,并在采样过程中进行信息交换。这种设计使GCG能够在统一表征中同时编码长期结构一致性与短期时序动态。在此表征引导下,HL-OutPaint执行高分辨率外扩,生成空间细节丰富且时间一致的内容。通过分离全局结构建模与细粒度合成,本框架在实现大空间扩展与长视频序列的稳定连贯生成方面表现优异。大量实验表明,HL-OutPaint在涉及宽空间外推与长视频序列的挑战性场景中显著优于现有方法。