Deducing a 3D human pose from a single 2D image or 2D keypoints is inherently challenging, given the fundamental ambiguity wherein multiple 3D poses can correspond to the same 2D representation. The acquisition of 3D data, while invaluable for resolving pose ambiguity, is expensive and requires an intricate setup, often restricting its applicability to controlled lab environments. We improve performance of monocular human pose estimation models using multiview data for fine-tuning. We propose a novel loss function, multiview consistency, to enable adding additional training data with only 2D supervision. This loss enforces that the inferred 3D pose from one view aligns with the inferred 3D pose from another view under similarity transformations. Our consistency loss substantially improves performance for fine-tuning with no available 3D data. Our experiments demonstrate that two views offset by 90 degrees are enough to obtain good performance, with only marginal improvements by adding more views. Thus, we enable the acquisition of domain-specific data by capturing activities with off-the-shelf cameras, eliminating the need for elaborate calibration procedures. This research introduces new possibilities for domain adaptation in 3D pose estimation, providing a practical and cost-effective solution to customize models for specific applications. The used dataset, featuring additional views, will be made publicly available.
翻译:从单张二维图像或二维关键点推断三维人体姿态存在根本性歧义问题——多个三维姿态可能对应同一二维表征,这使得该任务具有固有挑战性。三维数据的获取虽对解决姿态歧义至关重要,但其成本高昂且需要复杂设备配置,通常仅适用于受控实验室环境。我们提出利用多视角数据进行微调以提升单目人体姿态估计模型的性能。我们设计了一种新型损失函数——多视角一致性损失,使得仅需二维监督即可增加训练数据。该损失函数强制约束:在相似变换下,从某一视角推断出的三维姿态与另一视角推断出的三维姿态保持对齐。在无可用三维数据的微调场景中,我们的多视角一致性损失显著提升了模型性能。实验表明,仅需两个夹角为90度的视角即可获得良好效果,增加更多视角仅带来边际改进。因此,通过使用现成相机捕捉活动即可获取领域特定数据,无需复杂的标定流程。本研究为三维姿态估计中的领域自适应开辟了新可能,为特定应用场景定制模型提供了实用且经济的解决方案。包含额外视角的已用数据集将向公众开放。