Treating human motion and camera trajectory generation separately overlooks a core principle of cinematography: the tight interplay between actor performance and camera work in the screen space. In this paper, we are the first to cast this task as a text-conditioned joint generation, aiming to maintain consistent on-screen framing while producing two heterogeneous, yet intrinsically linked, modalities: human motion and camera trajectories. We propose a simple, model-agnostic framework that enforces multimodal coherence via an auxiliary modality: the on-screen framing induced by projecting human joints onto the camera. This on-screen framing provides a natural and effective bridge between modalities, promoting consistency and leading to more precise joint distribution. We first design a joint autoencoder that learns a shared latent space, together with a lightweight linear transform from the human and camera latents to a framing latent. We then introduce auxiliary sampling, which exploits this linear transform to steer generation toward a coherent framing modality. To support this task, we also introduce the PulpMotion dataset, a human-motion and camera-trajectory dataset with rich captions, and high-quality human motions. Extensive experiments across DiT- and MAR-based architectures show the generality and effectiveness of our method in generating on-frame coherent human-camera motions, while also achieving gains on textual alignment for both modalities. Our qualitative results yield more cinematographically meaningful framings setting the new state of the art for this task. Code, models and data are available in our \href{https://www.lix.polytechnique.fr/vista/projects/2025_pulpmotion_courant/}{project page}.
翻译:将人体运动与相机轨迹生成分开处理忽略了一个电影摄影的核心原则:演员表演与摄像机工作在屏幕空间中的紧密互动。本文首次将该任务构建为文本条件的联合生成,旨在保持一致的屏幕构图的同时,生成两种异质但内在关联的模态:人体运动与相机轨迹。我们提出一个简单、模型无关的框架,通过一种辅助模态来强制多模态一致性:将人体关节点投影到相机上所产生的屏幕构图。这种屏幕构图为模态之间提供了一个自然且有效的桥梁,促进一致性并实现更精确的联合分布。我们首先设计了一个联合自编码器来学习共享潜在空间,以及一个从人体和相机潜在表示到构图潜在表示的轻量级线性变换。随后,我们引入辅助采样方法,利用该线性变换引导生成过程朝向一致的构图模态。为支持此任务,我们还提出了PulpMotion数据集,这是一个包含丰富文本描述和高质量人体动作的人体运动与相机轨迹数据集。基于DiT和MAR架构的大量实验表明,我们的方法在生成构图一致的人体-相机运动方面具有普适性和有效性,同时在两种模态的文本对齐方面也取得了提升。我们的定性结果产生了更具电影摄影意义的构图,为此任务设定了新的技术标杆。代码、模型和数据可在我们的\href{https://www.lix.polytechnique.fr/vista/projects/2025_pulpmotion_courant/}{项目页面}获取。