ROPES: Robotic Pose Estimation via Score-Based Causal Representation Learning

Causal representation learning (CRL) has emerged as a powerful unsupervised framework that (i) disentangles the latent generative factors underlying high-dimensional data, and (ii) learns the cause-and-effect interactions among the disentangled variables. Despite extensive recent advances in identifiability and some practical progress, a substantial gap remains between theory and real-world practice. This paper takes a step toward closing that gap by bringing CRL to robotics, a domain that has motivated CRL. Specifically, this paper addresses the well-defined robot pose estimation -- the recovery of position and orientation from raw images -- by introducing Robotic Pose Estimation via Score-Based CRL (ROPES). Being an unsupervised framework, ROPES embodies the essence of interventional CRL by identifying those generative factors that are actuated: images are generated by intrinsic and extrinsic latent factors (e.g., joint angles, arm/limb geometry, lighting, background, and camera configuration) and the objective is to disentangle and recover the controllable latent variables, i.e., those that can be directly manipulated (intervened upon) through actuation. Interventional CRL theory shows that variables that undergo variations via interventions can be identified. In robotics, such interventions arise naturally by commanding actuators of various joints and recording images under varied controls. Empirical evaluations in semi-synthetic manipulator experiments demonstrate that ROPES successfully disentangles latent generative factors with high fidelity with respect to the ground truth. Crucially, this is achieved by leveraging only distributional changes, without using any labeled data. The paper also includes a comparison with a baseline based on a recently proposed semi-supervised framework. This paper concludes by positioning robot pose estimation as a near-practical testbed for CRL.

翻译：因果表示学习（CRL）已发展成为一种强大的无监督框架，其能够（i）解耦高维数据背后的潜在生成因子，（ii）学习解耦变量间的因果交互作用。尽管近期在可识别性理论方面取得了显著进展，并在实际应用中获得了一定进步，但理论与现实实践之间仍存在较大差距。本文通过将CRL引入机器人学这一曾推动CRL发展的领域，致力于缩小该差距。具体而言，本文针对定义明确的机器人姿态估计问题——即从原始图像中恢复位置与朝向——提出了基于分数的因果表示学习机器人姿态估计方法（ROPES）。作为一种无监督框架，ROPES通过识别受驱动的生成因子，体现了干预式CRL的核心思想：图像由内在与外在潜在因子（如关节角度、手臂/肢体几何结构、光照、背景及相机配置）生成，其目标是解耦并恢复可控制的潜在变量，即那些可通过驱动直接操纵（实施干预）的变量。干预式CRL理论表明，通过干预产生变化的变量是可识别的。在机器人学中，此类干预天然存在于对各关节执行器的指令控制及不同控制状态下记录的图像中。在半合成机械臂实验中的实证评估表明，ROPES能够以相对于真实数据的高保真度成功解耦潜在生成因子。关键在于，这一成果仅通过利用分布变化实现，而未使用任何标注数据。本文还将其与基于近期提出的半监督框架的基线方法进行了比较。最后，本文将机器人姿态估计定位为CRL近乎实用的测试平台。