We propose a novel Unmanned Aerial Vehicles (UAV) assisted creative capture system that leverages diffusion models to interpret high-level natural language prompts and automatically generate optimal flight trajectories for cinematic video recording. Instead of manually piloting the drone, the user simply describes the desired shot (e.g., "orbit around me slowly from the right and reveal the background waterfall"). Our system encodes the prompt along with an initial visual snapshot from the onboard camera, and a diffusion model samples plausible spatio-temporal motion plans that satisfy both the scene geometry and shot semantics. The generated flight trajectory is then executed autonomously by the UAV to record smooth, repeatable video clips that match the prompt. User evaluation using NASA-TLX showed a significantly lower overall workload with our interface (M = 21.6) compared to a traditional remote controller (M = 58.1), demonstrating a substantial reduction in perceived effort. Mental demand (M = 11.5 vs. 60.5) and frustration (M = 14.0 vs. 54.5) were also markedly lower for our system, confirming clear usability advantages in autonomous text-driven flight control. This project demonstrates a new interaction paradigm: text-to-cinema flight, where diffusion models act as the "creative operator" converting story intentions directly into aerial motion.
翻译:我们提出了一种新颖的无人机辅助创意拍摄系统,该系统利用扩散模型解析高层次自然语言提示,并自动生成用于电影视频录制的最优飞行轨迹。用户无需手动操控无人机,只需描述期望的镜头(例如:“从我右侧缓慢环绕飞行,并逐渐展现背景瀑布”)。我们的系统将提示词与机载摄像头的初始视觉快照一同编码,扩散模型则采样满足场景几何与镜头语义的合理时空运动规划。生成的飞行轨迹随后由无人机自主执行,以录制与提示匹配的平滑、可重复的视频片段。使用NASA-TLX进行的用户评估表明,相较于传统遥控器(M = 58.1),我们的界面总体工作负荷显著降低(M = 21.6),感知努力大幅减少。我们系统的心理需求(M = 11.5 对比 60.5)与挫败感(M = 14.0 对比 54.5)也明显更低,证实了在自主文本驱动飞行控制方面具有明确的可用性优势。本项目展示了一种新的交互范式:文本到电影飞行,其中扩散模型充当“创意操作员”,将故事意图直接转换为空中运动。