We describe an adaptation of VACE (Video All-in-one Creation and Editing) for real-time autoregressive video generation. VACE provides unified video control (reference guidance, structural conditioning, inpainting, and temporal extension) but assumes bidirectional attention over full sequences, making it incompatible with streaming pipelines that require fixed chunk sizes and causal attention. The key modification moves reference frames from the diffusion latent space into a parallel conditioning pathway, preserving the fixed chunk sizes and KV caching that autoregressive models require. This adaptation reuses existing pretrained VACE weights without additional training. Across 1.3B and 14B model scales, VACE adds 20-30% latency overhead for structural control and inpainting, with negligible VRAM cost relative to the base model. Reference-to-video fidelity is severely degraded compared to batch VACE due to causal attention constraints. A reference implementation is available at https://github.com/daydreamlive/scope.
翻译:本文描述了对VACE(视频一体化创建与编辑)系统进行实时自回归视频生成的适配方案。VACE提供统一的视频控制功能(参考引导、结构条件、修复补全与时间扩展),但其基于完整序列的双向注意力机制与需要固定块大小和因果注意力的流式处理流程不兼容。核心改进在于将参考帧从扩散潜在空间转移至并行条件通路,从而保持自回归模型所需的固定块大小和KV缓存机制。该适配方案无需额外训练即可复用现有预训练的VACE权重。在13亿和140亿参数规模的模型中,VACE为结构控制和修复补全功能带来20-30%的延迟开销,相较于基础模型的显存占用可忽略不计。由于因果注意力机制的限制,参考帧到视频的保真度较批处理版VACE显著下降。参考实现已发布于https://github.com/daydreamlive/scope。