3D pose estimation from sparse multi-views is a critical task for numerous applications, including action recognition, sports analysis, and human-robot interaction. Optimization-based methods typically follow a two-stage pipeline, first detecting 2D keypoints in each view and then associating these detections across views to triangulate the 3D pose. Existing methods rely on mere pairwise associations to model this correspondence problem, treating global consistency between views (i.e., cycle consistency) as a soft constraint. Yet, reconciling these constraints for multiple views becomes brittle when spurious associations propagate errors. We thus propose COMPOSE, a novel framework that formulates multi-view pose correspondence matching as a hypergraph partitioning problem rather than through pairwise association. While the complexity of the resulting integer linear program grows exponentially in theory, we introduce an efficient geometric pruning strategy to substantially reduce the search space. COMPOSE achieves improvements of up to 23% in average precision over previous optimization-based methods and up to 11% over self-supervised end-to-end learned methods, offering a promising solution to a widely studied problem.
翻译:从稀疏多视角进行三维姿态估计是动作识别、运动分析和人机交互等众多应用中的关键任务。基于优化的方法通常遵循两阶段流程:首先在每个视角中检测二维关键点,然后跨视角关联这些检测结果以三角化三维姿态。现有方法仅依赖成对关联来建模这一对应问题,将视角间的全局一致性(即循环一致性)视为软约束。然而,当虚假关联传播误差时,为多个视角调和这些约束会变得脆弱。因此,我们提出COMPOSE这一新颖框架,它将多视角姿态对应匹配表述为超图划分问题,而非通过成对关联实现。虽然所得整数线性规划的复杂度在理论上呈指数级增长,但我们引入了一种高效的几何剪枝策略以大幅缩减搜索空间。COMPOSE在平均精度上较先前基于优化的方法提升高达23%,较自监督端到端学习方法提升高达11%,为这一被广泛研究的问题提供了有前景的解决方案。