Rethinking Event-based Human Pose Estimation with 3D Event Representations

Human pose estimation is a critical component in autonomous driving and parking, enhancing safety by predicting human actions. Traditional frame-based cameras and videos are commonly applied, yet, they become less reliable in scenarios under high dynamic range or heavy motion blur. In contrast, event cameras offer a robust solution for navigating these challenging contexts. Predominant methodologies incorporate event cameras into learning frameworks by accumulating events into event frames. However, such methods tend to marginalize the intrinsic asynchronous and high temporal resolution characteristics of events. This disregard leads to a loss in essential temporal dimension data, crucial for safety-critical tasks associated with dynamic human activities. To address this issue and to unlock the 3D potential of event information, we introduce two 3D event representations: the Rasterized Event Point Cloud (RasEPC) and the Decoupled Event Voxel (DEV). The RasEPC collates events within concise temporal slices at identical positions, preserving 3D attributes with statistical cues and markedly mitigating memory and computational demands. Meanwhile, the DEV representation discretizes events into voxels and projects them across three orthogonal planes, utilizing decoupled event attention to retrieve 3D cues from the 2D planes. Furthermore, we develop and release EV-3DPW, a synthetic event-based dataset crafted to facilitate training and quantitative analysis in outdoor scenes. On the public real-world DHP19 dataset, our event point cloud technique excels in real-time mobile predictions, while the decoupled event voxel method achieves the highest accuracy. Experiments reveal our proposed 3D representation methods' superior generalization capacities against traditional RGB images and event frame techniques. Our code and dataset are available at https://github.com/MasterHow/EventPointPose.

翻译：[translated abstract in Chinese] 人体姿态估计是自动驾驶与泊车系统中的关键组成部分，通过预测人体动作提升安全性。传统帧式相机与视频虽被广泛应用，但在高动态范围或强运动模糊场景下可靠性显著下降。相比之下，事件相机为应对这些挑战性场景提供了稳健解决方案。主流方法通过将事件累积为事件帧来融入学习框架，但这类方法往往忽视了事件固有的异步性与高时间分辨率特性，导致动态人体活动相关的安全关键任务中缺失必要的时间维度数据。针对该问题，为释放事件信息的3D潜力，我们提出两种3D事件表征：栅格化事件点云（RasEPC）与解耦事件体素（DEV）。RasEPC在紧凑时间片内将同位置事件进行聚合，通过统计线索保留3D属性，显著降低内存与计算需求；DEV表征则将事件离散化为体素并投影至三个正交平面，利用解耦事件注意力从2D平面中检索3D线索。此外，我们构建并公开发布合成事件数据集EV-3DPW，旨在支持户外场景的训练与定量分析。在公开真实世界DHP19数据集上，本文事件点云技术在实时移动端预测中表现优异，而解耦事件体素方法达到最优精度。实验表明，相比传统RGB图像与事件帧技术，所提3D表征方法具有更优的泛化能力。代码与数据集开源地址：https://github.com/MasterHow/EventPointPose.