Vision-State Fusion: Improving Deep Neural Networks for Autonomous Robotics

Vision-based deep learning perception fulfills a paramount role in robotics, facilitating solutions to many challenging scenarios, such as acrobatic maneuvers of autonomous unmanned aerial vehicles (UAVs) and robot-assisted high-precision surgery. Control-oriented end-to-end perception approaches, which directly output control variables for the robot, commonly take advantage of the robot's state estimation as an auxiliary input. When intermediate outputs are estimated and fed to a lower-level controller, i.e. mediated approaches, the robot's state is commonly used as an input only for egocentric tasks, which estimate physical properties of the robot itself. In this work, we propose to apply a similar approach for the first time -- to the best of our knowledge -- to non-egocentric mediated tasks, where the estimated outputs refer to an external subject. We prove how our general methodology improves the regression performance of deep convolutional neural networks (CNNs) on a broad class of non-egocentric 3D pose estimation problems, with minimal computational cost. By analyzing three highly-different use cases, spanning from grasping with a robotic arm to following a human subject with a pocket-sized UAV, our results consistently improve the R\textsuperscript{2} regression metric, up to +0.51, compared to their stateless baselines. Finally, we validate the in-field performance of a closed-loop autonomous cm-scale UAV on the human pose estimation task. Our results show a significant reduction, i.e., 24\% on average, on the mean absolute error of our stateful CNN, compared to a State-of-the-Art stateless counterpart.

翻译：基于视觉的深度学习感知在机器人领域发挥着关键作用，为诸多挑战性场景提供了解决方案，例如自主无人机的特技机动和机器人辅助高精度手术。面向控制的端到端感知方法（直接输出机器人的控制变量）通常利用机器人状态估计作为辅助输入。当中间输出被估计并输入到低层控制器（即介导方法）时，机器人状态通常仅作为自我中心任务的输入，用于估计机器人自身的物理属性。在本工作中，我们首次提出将类似方法应用于非自我中心的介导任务——据我们所知，这是首例——其中估计输出指向外部主体。我们证明，这种通用方法能以最小计算成本提升深度卷积神经网络在一大类非自我中心三维姿态估计问题上的回归性能。通过分析三个高度差异化的用例（涵盖机器人手臂抓取任务和袖珍无人机跟随人体目标），我们的结果相较于无状态基线，持续改进了R\textsuperscript{2}回归指标，最高提升达+0.51。最后，我们验证了闭环自主厘米级无人机在人体姿态估计任务中的现场性能。结果表明，与最先进的无状态网络相比，我们的有状态CNN的平均绝对误差显著降低，平均降幅达24%。