Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/
翻译:人类可以毫不费力地抓取物体,而多指机器人远未达到这种通用程度。我们认为机器人抓取数据最自然的来源是人类——他们每天拿起数千个物体。我们提出HUG,一种流匹配模型,能够为立体摄像头捕获的任意用户指定物体的单张RGB-D图像生成多样的人类抓取姿态。首先,利用智能眼镜采集1M-HUG数据集——一个包含100万帧(27.8小时)的自我中心视角人类抓取数据集,涵盖41栋建筑内的6,707个物体实例。其次,为模拟自然人类抓取分布,我们创新的流匹配模型融合RGB与深度观测数据,输出由手腕平移、手腕旋转及MANO手部姿态参数化的抓取姿态。预测的抓取姿态可重定向到各种机器人手部,实现日常场景中的零样本抓取。为标准化评估,我们构建了新的仿真基准HUG-Bench,包含来自五种几何类别及不同尺寸的90个未见物体,并配有度量尺度3D网格。我们在现实世界中基于HUG-Bench的30个物体测试集,使用多种立体摄像头、机器人实体及家庭环境评估HUG。HUG在我们的挑战性物体集上分别比最先进的抓取基线方法提升23%和34%。代码、数据、基准、检查点及交互式演示已发布于官网:https://grasping.io/