We propose a sparse and privacy-enhanced representation for Human Pose Estimation (HPE). Given a perspective camera, we use a proprietary motion vector sensor(MVS) to extract an edge image and a two-directional motion vector image at each time frame. Both edge and motion vector images are sparse and contain much less information (i.e., enhancing human privacy). We advocate that edge information is essential for HPE, and motion vectors complement edge information during fast movements. We propose a fusion network leveraging recent advances in sparse convolution used typically for 3D voxels to efficiently process our proposed sparse representation, which achieves about 13x speed-up and 96% reduction in FLOPs. We collect an in-house edge and motion vector dataset with 16 types of actions by 40 users using the proprietary MVS. Our method outperforms individual modalities using only edge or motion vector images. Finally, we validate the privacy-enhanced quality of our sparse representation through face recognition on CelebA (a large face dataset) and a user study on our in-house dataset.
翻译:我们提出了一种用于人体姿态估计(HPE)的稀疏且隐私增强的表示方法。在透视相机条件下,我们利用专有运动矢量传感器(MVS)在每个时间帧提取边缘图像和双向运动矢量图像。边缘和运动矢量图像均具有稀疏性,且包含的信息量显著减少(从而增强了人体隐私)。我们主张边缘信息对于HPE至关重要,而运动矢量可在快速运动过程中对边缘信息进行补充。我们提出了一种融合网络,借鉴了通常用于三维体素的稀疏卷积技术的最新进展,以高效处理所提出的稀疏表示,实现了约13倍的加速和96%的FLOPs缩减。我们使用专有MVS采集了包含40名用户、16种动作类型的内部边缘与运动矢量数据集。该方法在仅使用边缘或运动矢量图像的单一模态上表现更优。最后,通过在CelebA(大型人脸数据集)上的人脸识别实验以及基于内部数据集的用户研究,验证了我们稀疏表示的隐私增强特性。