Designing egocentric 3D hand pose estimation systems that can perform reliably in complex, real-world scenarios is crucial for downstream applications. Previous approaches using RGB or NIR imagery struggle in challenging conditions: RGB methods are susceptible to lighting variations and obstructions like handwear, while NIR techniques can be disrupted by sunlight or interference from other NIR-equipped devices. To address these limitations, we present ThermoHands, the first benchmark focused on thermal image-based egocentric 3D hand pose estimation, demonstrating the potential of thermal imaging to achieve robust performance under these conditions. The benchmark includes a multi-view and multi-spectral dataset collected from 28 subjects performing hand-object and hand-virtual interactions under diverse scenarios, accurately annotated with 3D hand poses through an automated process. We introduce a new baseline method, TherFormer, utilizing dual transformer modules for effective egocentric 3D hand pose estimation in thermal imagery. Our experimental results highlight TherFormer's leading performance and affirm thermal imaging's effectiveness in enabling robust 3D hand pose estimation in adverse conditions.
翻译:设计能够在复杂真实场景中可靠运行的第一视角三维手部姿态估计系统,对于下游应用至关重要。现有基于RGB或近红外图像的方法在挑战性环境中存在局限:RGB方法易受光照变化及手套等遮挡物影响,而近红外技术则可能因阳光干扰或其他近红外设备信号干扰而失效。为突破这些限制,我们提出了ThermoHands——首个专注于基于热成像的第一视角三维手部姿态估计的基准,展示了热成像在此类条件下实现鲁棒性能的潜力。该基准包含从28名受试者在多样化场景下执行手-物交互与手-虚拟交互时采集的多视角、多光谱数据集,并通过自动化流程实现了精确的三维手部姿态标注。我们提出了一种新的基线方法TherFormer,该方法利用双Transformer模块实现热成像中高效的第一视角三维手部姿态估计。实验结果证明了TherFormer的领先性能,并证实了热成像在恶劣条件下实现鲁棒三维手部姿态估计的有效性。