Pulp Motion: Framing-aware multimodal camera and human motion generation

Treating human motion and camera trajectory generation separately overlooks a core principle of cinematography: the tight interplay between actor performance and camera work in the screen space. In this paper, we are the first to cast this task as a text-conditioned joint generation, aiming to maintain consistent on-screen framing while producing two heterogeneous, yet intrinsically linked, modalities: human motion and camera trajectories. We propose a simple, model-agnostic framework that enforces multimodal coherence via an auxiliary modality: the on-screen framing induced by projecting human joints onto the camera. This on-screen framing provides a natural and effective bridge between modalities, promoting consistency and leading to more precise joint distribution. We first design a joint autoencoder that learns a shared latent space, together with a lightweight linear transform from the human and camera latents to a framing latent. We then introduce auxiliary sampling, which exploits this linear transform to steer generation toward a coherent framing modality. To support this task, we also introduce the PulpMotion dataset, a human-motion and camera-trajectory dataset with rich captions, and high-quality human motions. Extensive experiments across DiT- and MAR-based architectures show the generality and effectiveness of our method in generating on-frame coherent human-camera motions, while also achieving gains on textual alignment for both modalities. Our qualitative results yield more cinematographically meaningful framings setting the new state of the art for this task. Code, models and data are available in our \href{https://www.lix.polytechnique.fr/vista/projects/2025_pulpmotion_courant/}{project page}.

翻译：将人体运动与相机轨迹生成分开处理，忽视了电影摄影的核心原则：银幕空间中演员表演与镜头运作之间的紧密交织。本文首次将此任务定义为文本条件联合生成，旨在生成两种异质但内在关联的模态——人体运动与相机轨迹——的同时保持一致的屏幕取景。我们提出一个简单且模型无关的框架，通过辅助模态——即人体关节点投影到相机所生成的屏幕取景——强制执行多模态一致性。这种屏幕取景在模态之间提供了自然且有效的桥梁，促进一致性并导向更精确的联合分布。我们首先设计一个联合自编码器，学习共享潜在空间，并引入从人体与相机潜变量到取景潜变量的轻量线性变换。随后引入辅助采样，利用该线性变换引导生成过程朝向连贯的取景模态。为支撑该任务，我们还发布了PulpMotion数据集，该数据集包含丰富的文本描述与高质量人体运动，并配套相机轨迹。基于DiT与MAR架构的广泛实验表明，我们的方法在生成屏幕内连贯的人体-相机运动方面具有通用性与有效性，同时在两种模态的文本对齐上亦取得提升。定性结果产生了更具电影摄影意义的取景，为该任务树立了新的标杆。代码、模型与数据可见于我们的项目页面。