PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models

Robotic grasping is a fundamental aspect of robot functionality, defining how robots interact with objects. Despite substantial progress, its generalizability to counter-intuitive or long-tailed scenarios, such as objects with uncommon materials or shapes, remains a challenge. In contrast, humans can easily apply their intuitive physics to grasp skillfully and change grasps efficiently, even for objects they have never seen before. This work delves into infusing such physical commonsense reasoning into robotic manipulation. We introduce PhyGrasp, a multimodal large model that leverages inputs from two modalities: natural language and 3D point clouds, seamlessly integrated through a bridge module. The language modality exhibits robust reasoning capabilities concerning the impacts of diverse physical properties on grasping, while the 3D modality comprehends object shapes and parts. With these two capabilities, PhyGrasp is able to accurately assess the physical properties of object parts and determine optimal grasping poses. Additionally, the model's language comprehension enables human instruction interpretation, generating grasping poses that align with human preferences. To train PhyGrasp, we construct a dataset PhyPartNet with 195K object instances with varying physical properties and human preferences, alongside their corresponding language descriptions. Extensive experiments conducted in the simulation and on the real robots demonstrate that PhyGrasp achieves state-of-the-art performance, particularly in long-tailed cases, e.g., about 10% improvement in success rate over GraspNet. Project page: https://sites.google.com/view/phygrasp

翻译：机器人抓取是机器人功能的基础方面，决定了机器人如何与物体交互。尽管取得了显著进展，但在反直觉或长尾场景（如材质或形状不常见的物体）中的泛化能力仍是挑战。相比之下，人类能轻松利用直觉物理学熟练抓取并高效调整抓取方式，即便面对从未见过的物体。本研究致力于将此类物理常识推理注入机器人操作。我们提出PhyGrasp，一种多模态大规模模型，利用自然语言和三维点云两种模态的输入，通过桥接模块无缝整合。语言模态展现出关于不同物理属性对抓取影响的稳健推理能力，而三维模态则理解物体形状与部件。凭借这两种能力，PhyGrasp能准确评估物体部件的物理属性并确定最优抓取姿态。此外，模型的语言理解能力使其能解释人类指令，生成符合人类偏好的抓取姿态。为训练PhyGrasp，我们构建了PhyPartNet数据集，包含19.5万个具有不同物理属性和人类偏好的物体实例及其对应语言描述。在仿真环境和真实机器人上的大量实验表明，PhyGrasp达到了最先进性能，尤其在长尾案例中，例如在成功率上较GraspNet提升约10%。项目页面：https://sites.google.com/view/phygrasp