Humans can intuitively parallelise complex activities, but can a model learn this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: how N individuals, can hypothetically perform the same set of tasks observed in this video. The goal is to maximise speed-up, but naive assignment of video segments to individuals often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To address this, we formalise the N-Body Problem and propose a suite of metrics to evaluate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts and causal constraints). We then introduce a structured prompting strategy that guides a Vision-Language Model (VLM) to reason about the 3D environment, object usage, and temporal dependencies to produce a viable parallel execution. On 100 videos from EPIC-Kitchens and HD-EPIC, our method for N = 2 boosts action coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously slashing collision rates, object and causal conflicts by 55%, 45% and 55% respectively.
翻译:人类能够直观地并行化复杂活动,但模型能否通过观察单个人的行为学习这种能力?给定一段自我中心视频,我们提出了N体问题:即N个个体如何假设性地执行该视频中观察到的同一组任务。目标是最大化加速比,但将视频片段简单分配给不同个体通常会违反现实世界的约束,导致物理上不可能的场景,例如两个人使用同一物体或占据同一空间。为解决这一问题,我们形式化了N体问题,并提出了一套评估指标,涵盖性能(加速比、任务覆盖率)和可行性(空间碰撞、物体冲突与因果约束)。随后,我们引入了一种结构化提示策略,引导视觉语言模型(VLM)对三维环境、物体使用及时间依赖关系进行推理,以生成可行的并行执行方案。在EPIC-Kitchens和HD-EPIC数据集的100个视频上,针对N=2的情况,我们的方法相较于Gemini 2.5 Pro的基线提示,将动作覆盖率提升了45%,同时将碰撞率、物体冲突和因果冲突分别降低了55%、45%和55%。