We present SimXR, a method for controlling a simulated avatar from information (headset pose and cameras) obtained from AR / VR headsets. Due to the challenging viewpoint of head-mounted cameras, the human body is often clipped out of view, making traditional image-based egocentric pose estimation challenging. On the other hand, headset poses provide valuable information about overall body motion, but lack fine-grained details about the hands and feet. To synergize headset poses with cameras, we control a humanoid to track headset movement while analyzing input images to decide body movement. When body parts are seen, the movements of hands and feet will be guided by the images; when unseen, the laws of physics guide the controller to generate plausible motion. We design an end-to-end method that does not rely on any intermediate representations and learns to directly map from images and headset poses to humanoid control signals. To train our method, we also propose a large-scale synthetic dataset created using camera configurations compatible with a commercially available VR headset (Quest 2) and show promising results on real-world captures. To demonstrate the applicability of our framework, we also test it on an AR headset with a forward-facing camera.
翻译:我们提出了SimXR,一种利用AR/VR头戴设备获取的信息(头戴设备姿态和摄像头)来控制模拟化身的方法。由于头戴式摄像头的视角具有挑战性,人体往往被裁剪出视野范围,使得基于图像的传统自我中心姿态估计变得困难。另一方面,头戴设备姿态提供了关于整体身体运动的有价值信息,但缺乏手部和脚部的精细细节。为了协同头戴设备姿态与摄像头,我们控制一个类人模型来跟踪头戴设备运动,同时分析输入图像以决定身体运动。当身体部位可见时,手部和脚部的运动将由图像引导;当不可见时,物理定律引导控制器生成合理的运动。我们设计了一种端到端的方法,该方法不依赖任何中间表征,并学习直接从图像和头戴设备姿态映射到类人控制信号。为了训练我们的方法,我们还提出了一个大规模合成数据集,该数据集使用与商用VR头戴设备(Quest 2)兼容的摄像头配置创建,并在真实世界捕获中展示了有前景的结果。为展示我们框架的适用性,我们还在具有前置摄像头的AR头戴设备上进行了测试。