Effective and generalizable control in video generation remains a significant challenge. While many methods rely on ambiguous or task-specific signals, we argue that a fundamental disentanglement of "appearance" and "motion" provides a more robust and scalable pathway. We propose FlexAM, a unified framework built upon a novel 3D control signal. This signal represents video dynamics as a point cloud, introducing three key enhancements: multi-frequency positional encoding to distinguish fine-grained motion, depth-aware positional encoding, and a flexible control signal for balancing precision and generative quality. This representation allows FlexAM to effectively disentangle appearance and motion, enabling a wide range of tasks including I2V/V2V editing, camera control, and spatial object editing. Extensive experiments demonstrate that FlexAM achieves superior performance across all evaluated tasks.
翻译:在视频生成中实现有效且泛化性强的控制仍是一项重大挑战。尽管许多方法依赖于模糊或任务特定的信号,我们认为对"外观"与"运动"进行根本性解耦能提供更鲁棒且可扩展的路径。我们提出FlexAM,这是一个基于新型三维控制信号的统一框架。该信号将视频动态表示为点云,并引入三项关键增强:用于区分细粒度运动的多频位置编码、深度感知位置编码,以及用于平衡精度与生成质量的灵活控制信号。这种表示使FlexAM能有效解耦外观与运动,支持包括图像到视频/视频到视频编辑、摄像机控制和空间对象编辑在内的广泛任务。大量实验表明,FlexAM在所有评估任务中均取得了优越性能。