基于单视角多层推理的学习型透视抓取方法 (Multi-Layered Reasoning from a Single Viewpoint for Learning See-Through Grasping)

from arxiv, 23 pages, 13 figures, 2 tables, for supplementary videos, see https://bionicdl.ancorasir.com/?p=1658, for opensourced codes, see https://github.com/ ancorasir/SeeThruFinger

Sensory substitution enables biological systems to perceive stimuli typically obtained by another organ, which is inspirational for physical agents. Multi-modal perception of intrinsic and extrinsic interactions is critical in building an intelligent robot that learns. This study presents a Vision-based See-Through Perception (VBSeeThruP) architecture that simultaneously perceives multiple intrinsic and extrinsic modalities via a single visual input in a markerless way, all packed within a soft robotic finger using the Soft Polyhedral Network design. It is generally applicable to miniature vision systems placed underneath deformable networks with a see-through design, capturing real-time images of the network's physical interactions induced by contact-based events overlayed on top of the visual scene of the external environment, as demonstrated in the ablation study. We present the VBSeeThruP's capability for learning reactive grasping without using external cameras or dedicated force and torque sensors on the fingertips. Using the inpainted scene and the deformation mask, we further demonstrate the multi-modal performance of the VBSeeThruP architecture to simultaneously achieve various perceptions, including but not limited to scene inpainting, object detection, depth sensing, scene segmentation, masked deformation tracking, 6D force/torque sensing, and contact event detection, all within a single sensory input from the in-finger vision markerlessly.

翻译：感觉替代使生物系统能够感知通常由另一器官获取的刺激，这对物理智能体具有启发意义。对内在与外在交互的多模态感知对于构建具备学习能力的智能机器人至关重要。本研究提出一种基于视觉的透视感知架构，该架构通过单一视觉输入以无标记方式同时感知多种内在与外在模态，所有功能均集成于采用软体多面体网络设计的软体机器人手指中。该架构普遍适用于置于可变形网络下方的微型视觉系统，其透视设计能实时捕获由接触事件引发的网络物理交互图像，这些交互图像叠加在外部环境视觉场景之上，如消融实验所示。我们展示了该架构在不使用外部摄像头或指尖专用力/力矩传感器的情况下学习反应式抓取的能力。利用修复场景与变形掩模，我们进一步论证了该架构的多模态性能，能够同时实现多种感知任务，包括但不限于：场景修复、目标检测、深度感知、场景分割、掩模变形跟踪、六维力/力矩传感以及接触事件检测，所有这些功能仅通过手指内部视觉系统的单一无标记感官输入实现。