Visual effects (VFX) are key to immersion in modern films, games, and AR/VR. Creating 3D effects requires specialized expertise and training in 3D animation software and can be time consuming. Generative solutions typically rely on computationally intense methods such as diffusion models which can be slow at 4D inference. We reformulate 3D animation as a field prediction task and introduce a text-driven framework that infers a time-varying 4D flow field acting on 3D Gaussians. By leveraging large language models (LLMs) and vision-language models (VLMs) for function generation, our approach interprets arbitrary prompts (e.g., "make the vase glow orange, then explode") and instantly updates color, opacity, and positions of 3D Gaussians in real time. This design avoids overheads such as mesh extraction, manual or physics-based simulations and allows both novice and expert users to animate volumetric scenes with minimal effort on a consumer device even in a web browser. Experimental results show that simple textual instructions suffice to generate compelling time-varying VFX, reducing the manual effort typically required for rigging or advanced modeling. We thus present a fast and accessible pathway to language-driven 3D content creation that can pave the way to democratize VFX further. Code available at https://obsphera.github.io/promptvfx/.
翻译:视觉特效(VFX)是现代电影、游戏及增强现实/虚拟现实(AR/VR)沉浸体验的关键。创建三维特效通常需要掌握三维动画软件的专业知识与技能,且过程耗时。现有生成式解决方案多依赖计算密集的方法(如扩散模型),其在四维推理时往往速度较慢。本文将三维动画重新定义为场预测任务,并提出一种文本驱动框架,该框架能够推断作用于三维高斯点的时间演化四维流场。通过利用大语言模型(LLM)和视觉-语言模型(VLM)进行函数生成,我们的方法能够解读任意文本指令(例如“让花瓶发出橙色光芒,然后爆炸”),并实时更新三维高斯点的颜色、不透明度与位置。该设计避免了网格提取、手动或基于物理的模拟等开销,使得新手与专家用户均能在消费级设备(甚至网页浏览器中)以极简操作对体素化场景进行动画制作。实验结果表明,简单的文本指令足以生成引人注目的时变视觉特效,显著减少了通常为角色绑定或高级建模所需的手动工作量。因此,我们提出了一条快速且易于访问的语言驱动三维内容创作路径,有望进一步推动视觉特效的普及化。代码发布于 https://obsphera.github.io/promptvfx/。