We present Agent-to-Sim (ATS), a framework for learning interactive behavior models of 3D agents from casual longitudinal video collections. Different from prior works that rely on marker-based tracking and multiview cameras, ATS learns natural behaviors of animal and human agents non-invasively through video observations recorded over a long time-span (e.g., a month) in a single environment. Modeling 3D behavior of an agent requires persistent 3D tracking (e.g., knowing which point corresponds to which) over a long time period. To obtain such data, we develop a coarse-to-fine registration method that tracks the agent and the camera over time through a canonical 3D space, resulting in a complete and persistent spacetime 4D representation. We then train a generative model of agent behaviors using paired data of perception and motion of an agent queried from the 4D reconstruction. ATS enables real-to-sim transfer from video recordings of an agent to an interactive behavior simulator. We demonstrate results on pets (e.g., cat, dog, bunny) and human given monocular RGBD videos captured by a smartphone.
翻译:我们提出了Agent-to-Sim(ATS),一个从日常长期视频集合中学习三维智能体交互式行为模型的框架。与以往依赖基于标记的追踪和多视角相机的工作不同,ATS通过在一个单一环境中长时间(例如一个月)记录的视频观察,以非侵入式的方式学习动物和人类智能体的自然行为。对智能体的三维行为进行建模需要在长时间内进行持续的三维追踪(例如,知道哪个点对应哪个点)。为了获取此类数据,我们开发了一种由粗到精的配准方法,该方法通过一个规范的三维空间随时间追踪智能体和相机,从而得到一个完整且持续的时空四维表示。然后,我们利用从四维重建中查询到的智能体感知与运动配对数据,训练一个智能体行为的生成模型。ATS实现了从智能体的视频记录到交互式行为模拟器的真实到模拟的迁移。我们展示了在宠物(例如猫、狗、兔子)和人类身上,使用智能手机捕捉的单目RGBD视频所得到的结果。