Specifying nuanced and compelling camera motion remains a major hurdle for non-expert creators using generative tools, creating an ``expressive gap" where generic text prompts fail to capture cinematic vision. To address this, we present a novel zero-shot diffusion-based system that enables personalized camera motion transfer from a single reference video onto a user-provided static image. Our technical contribution introduces an intuitive interaction paradigm that bypasses the need for 3D data, predefined trajectories, or complex graphical interfaces. The core pipeline leverages a text-to-video diffusion model, employing a two-phase strategy: 1) a multi-concept learning method using LoRA layers and an orthogonality loss to distinctly capture spatial-temporal characteristics and scene features, and 2) a homography-based refinement strategy to enhance temporal and spatial alignment of the generated video. Extensive evaluation demonstrates the efficacy of our method. In a comparative study with 72 participants, our system was significantly preferred over prior work for both motion accuracy (90.45\%) and scene preservation (70.31\%). A second study confirmed our interface significantly improves usability and creative control for video direction. Our work contributes a robust technical solution and a novel human-centered design, significantly expanding cinematic video editing for diverse users.
翻译:对于使用生成工具的非专业创作者而言,指定细微且富有表现力的摄像机运动仍是一个主要障碍,这造成了“表达鸿沟”——通用的文本提示无法捕捉电影化的视觉构想。为解决此问题,我们提出了一种新颖的零样本扩散基系统,能够将单个参考视频中的个性化摄像机运动迁移到用户提供的静态图像上。我们的技术贡献引入了一种直观的交互范式,绕过了对3D数据、预定义轨迹或复杂图形界面的需求。核心流程利用文本到视频扩散模型,采用两阶段策略:1)使用LoRA层和正交性损失的多概念学习方法,以分别捕获时空特征与场景特征;2)基于单应性的细化策略,以增强生成视频的时空对齐效果。广泛的评估证明了我们方法的有效性。在一项涉及72名参与者的对比研究中,我们的系统在运动准确性(90.45%)和场景保持度(70.31%)方面均显著优于先前工作。第二项研究证实,我们的界面显著提升了视频导演的可用性和创作控制力。我们的工作贡献了一个鲁棒的技术解决方案和一种新颖的以人为中心的设计,极大地扩展了面向多样化用户的电影化视频编辑能力。