The ability to forecast human-environment collisions from egocentric observations is vital to enable collision avoidance in applications such as VR, AR, and wearable assistive robotics. In this work, we introduce the challenging problem of predicting collisions in diverse environments from multi-view egocentric videos captured from body-mounted cameras. Solving this problem requires a generalizable perception system that can classify which human body joints will collide and estimate a collision region heatmap to localize collisions in the environment. To achieve this, we propose a transformer-based model called COPILOT to perform collision prediction and localization simultaneously, which accumulates information across multi-view inputs through a novel 4D space-time-viewpoint attention mechanism. To train our model and enable future research on this task, we develop a synthetic data generation framework that produces egocentric videos of virtual humans moving and colliding within diverse 3D environments. This framework is then used to establish a large-scale dataset consisting of 8.6M egocentric RGBD frames. Extensive experiments show that COPILOT generalizes to unseen synthetic as well as real-world scenes. We further demonstrate COPILOT outputs are useful for downstream collision avoidance through simple closed-loop control. Please visit our project webpage at https://sites.google.com/stanford.edu/copilot.
翻译:从第一视角观测中预测人与环境碰撞的能力对于虚拟现实(VR)、增强现实(AR)及可穿戴辅助机器人等应用中实现碰撞规避至关重要。本文提出一项挑战性问题:基于身体穿戴摄像机采集的多视角第一视角视频,在多样化环境中预测碰撞。解决该问题需要具备泛化能力的感知系统,既能识别将要碰撞的人体关节类别,又能估计碰撞区域热力图以定位环境中的碰撞位置。为此,我们提出名为COPILOT的Transformer模型,通过新型四维时空-视角注意力机制聚合多视角输入信息,同步实现碰撞预测与定位。为训练模型并推动该任务的后续研究,我们开发了合成数据生成框架,可生成虚拟人在多种三维环境中运动并发生碰撞的第一视角视频。基于该框架构建的大规模数据集包含860万帧第一视角RGBD图像。大量实验表明,COPILOT可泛化至未见过的合成场景及真实世界场景。我们进一步通过简单闭环控制验证,COPILOT输出结果可有效支持下游碰撞规避任务。项目网页详见https://sites.google.com/stanford.edu/copilot。