MVSA-Net: Multi-View State-Action Recognition for Robust and Deployable Trajectory Generation

The learn-from-observation (LfO) paradigm is a human-inspired mode for a robot to learn to perform a task simply by watching it being performed. LfO can facilitate robot integration on factory floors by minimizing disruption and reducing tedious programming. A key component of the LfO pipeline is a transformation of the depth camera frames to the corresponding task state and action pairs, which are then relayed to learning techniques such as imitation or inverse reinforcement learning for understanding the task parameters. While several existing computer vision models analyze videos for activity recognition, SA-Net specifically targets robotic LfO from RGB-D data. However, SA-Net and many other models analyze frame data captured from a single viewpoint. Their analysis is therefore highly sensitive to occlusions of the observed task, which are frequent in deployments. An obvious way of reducing occlusions is to simultaneously observe the task from multiple viewpoints and synchronously fuse the multiple streams in the model. Toward this, we present multi-view SA-Net, which generalizes the SA-Net model to allow the perception of multiple viewpoints of the task activity, integrate them, and better recognize the state and action in each frame. Performance evaluations on two distinct domains establish that MVSA-Net recognizes the state-action pairs under occlusion more accurately compared to single-view MVSA-Net and other baselines. Our ablation studies further evaluate its performance under different ambient conditions and establish the contribution of the architecture components. As such, MVSA-Net offers a significantly more robust and deployable state-action trajectory generation compared to previous methods.

翻译：从观察中学习（LfO）范式是一种受人类启发的机器人学习模式，使机器人仅通过观察任务执行过程即可学会执行任务。LfO通过最小化干扰和减少繁琐的编程过程，有助于促进机器人在工厂车间的集成。LfO流程的关键组成部分是将深度相机帧转换为对应的任务状态-动作对，随后这些状态-动作对被传输至模仿学习或逆向强化学习等学习技术中，以理解任务参数。尽管现有多种计算机视觉模型通过分析视频进行活动识别，但SA-Net专门针对基于RGB-D数据的机器人LfO任务。然而，SA-Net与众多其他模型仅分析单视角捕获的帧数据，其分析结果极易受观测任务中频繁出现的遮挡影响。减少遮挡的直观方法是从多视角同步观测任务，并在模型中融合多个数据流。为此，我们提出多视角SA-Net（MVSA-Net），该模型将SA-Net泛化，使其能够感知任务活动的多个视角并进行融合，从而更准确地识别每一帧中的状态与动作。在两个不同领域的性能评估表明，与单视角MVSA-Net及其他基准方法相比，MVSA-Net在遮挡条件下能更精确地识别状态-动作对。消融研究进一步评估了模型在不同环境条件下的性能，并验证了各架构组件的贡献。据此，MVSA-Net相较于先前方法，提供了显著更鲁棒且可部署的状态-动作轨迹生成方案。