Event cameras offer multiple advantages in monocular egocentric 3D human pose estimation from head-mounted devices, such as millisecond temporal resolution, high dynamic range, and negligible motion blur. Existing methods effectively leverage these properties, but suffer from low 3D estimation accuracy, insufficient in many applications (e.g., immersive VR/AR). This is due to the design not being fully tailored towards event streams (e.g., their asynchronous and continuous nature), leading to high sensitivity to self-occlusions and temporal jitter in the estimates. This paper rethinks the setting and introduces E-3DPSM, an event-driven continuous pose state machine for event-based egocentric 3D human pose estimation. E-3DPSM aligns continuous human motion with fine-grained event dynamics; it evolves latent states and predicts continuous changes in 3D joint positions associated with observed events, which are fused with direct 3D human pose predictions, leading to stable and drift-free final 3D pose reconstructions. E-3DPSM runs in real-time at 80 Hz on a single workstation and sets a new state of the art in experiments on two benchmarks, improving accuracy by up to 19% (MPJPE) and temporal stability by up to 2.7x. See our project page for the source code and trained models.
翻译:事件相机在头戴式设备的单目自我中心三维人体姿态估计中具有多重优势,例如毫秒级时间分辨率、高动态范围以及几乎无运动模糊。现有方法有效利用了这些特性,但存在三维估计精度低的问题,在许多应用场景(如沉浸式VR/AR)中仍显不足。这源于其设计未完全针对事件流特性(如异步和连续性),导致估计结果对自遮挡和时间抖动高度敏感。本文重新审视该设定,并引入E-3DPSM——一种面向事件驱动的连续姿态状态机,用于基于事件的自我中心三维人体姿态估计。E-3DPSM将连续人体运动与细粒度事件动态对齐;它演化潜在状态,预测与观测事件关联的三维关节点位置的连续变化,并将其与直接三维人体姿态预测融合,从而生成稳定且无漂移的最终三维姿态重建。E-3DPSM在单台工作站上以80Hz实时运行,并在两个基准测试中创下新纪录:精度(MPJPE)提升高达19%,时间稳定性提升高达2.7倍。源代码与训练模型详见项目页面。