Object fetching from cluttered shelves is an important capability for robots to assist humans in real-world scenarios. Achieving this task demands robotic behaviors that prioritize safety by minimizing disturbances to surrounding objects, an essential but highly challenging requirement due to restricted motion space, limited fields of view, and complex object dynamics. In this paper, we introduce FetchBot, a sim-to-real framework designed to enable zero-shot generalizable and safety-aware object fetching from cluttered shelves in real-world settings. To address data scarcity, we propose an efficient voxel-based method for generating diverse simulated cluttered shelf scenes at scale and train a dynamics-aware reinforcement learning (RL) policy to generate object fetching trajectories within these scenes. This RL policy, which leverages oracle information, is subsequently distilled into a vision-based policy for real-world deployment. Considering that sim-to-real discrepancies stem from texture variations mostly while from geometric dimensions rarely, we propose to adopt depth information estimated by full-fledged depth foundation models as the input for the vision-based policy to mitigate sim-to-real gap. To tackle the challenge of limited views, we design a novel architecture for learning multi-view representations, allowing for comprehensive encoding of cluttered shelf scenes. This enables FetchBot to effectively minimize collisions while fetching objects from varying positions and depths, ensuring robust and safety-aware operation. Both simulation and real-robot experiments demonstrate FetchBot's superior generalization ability, particularly in handling a broad range of real-world scenarios, includ
翻译:从杂乱货架中抓取物体是机器人在现实场景中协助人类的一项重要能力。实现这一任务要求机器人行为优先考虑安全性,即最小化对周围物体的干扰。由于运动空间受限、视野有限以及物体动力学复杂,这一要求至关重要且极具挑战性。本文介绍了FetchBot,一个模拟到现实框架,旨在实现从现实世界杂乱货架中进行零样本可泛化且具有安全意识的物体抓取。为了解决数据稀缺问题,我们提出了一种高效的基于体素的方法,用于大规模生成多样化的模拟杂乱货架场景,并训练一个具有动力学意识的强化学习策略,以在这些场景中生成物体抓取轨迹。该利用先验信息的强化学习策略随后被提炼成一个基于视觉的策略,用于现实世界部署。考虑到模拟到现实的差异主要源于纹理变化,而很少源于几何尺寸,我们建议采用由成熟的深度基础模型估计的深度信息作为基于视觉策略的输入,以弥合模拟到现实的差距。为了应对视野有限的挑战,我们设计了一种新颖的学习多视图表示的架构,允许对杂乱货架场景进行全面编码。这使得FetchBot能够有效地最小化碰撞,同时从不同位置和深度抓取物体,确保稳健且具有安全意识的操作。模拟和真实机器人实验均证明了FetchBot卓越的泛化能力,特别是在处理广泛的现实世界场景方面,包括