Grasping large and flat objects (e.g. a book or a pan) is often regarded as an ungraspable task, which poses significant challenges due to the unreachable grasping poses. Previous works leverage Extrinsic Dexterity like walls or table edges to grasp such objects. However, they are limited to task-specific policies and lack task planning to find pre-grasp conditions. This makes it difficult to adapt to various environments and extrinsic dexterity constraints. Therefore, we present DexDiff, a robust robotic manipulation method for long-horizon planning with extrinsic dexterity. Specifically, we utilize a vision-language model (VLM) to perceive the environmental state and generate high-level task plans, followed by a goal-conditioned action diffusion (GCAD) model to predict the sequence of low-level actions. This model learns the low-level policy from offline data with the cumulative reward guided by high-level planning as the goal condition, which allows for improved prediction of robot actions. Experimental results demonstrate that our method not only effectively performs ungraspable tasks but also generalizes to previously unseen objects. It outperforms baselines by a 47% higher success rate in simulation and facilitates efficient deployment and manipulation in real-world scenarios.
翻译:抓取大型扁平物体(如书本或平底锅)通常被视为不可抓取任务,由于无法达到抓取位姿而带来重大挑战。先前研究利用墙壁或桌沿等外灵巧性来抓取此类物体,但这些方法受限于任务特定策略且缺乏寻找预抓取条件的任务规划,难以适应多样化环境与外灵巧约束。为此,我们提出DexDiff——一种面向外灵巧性长时程规划的鲁棒机器人操作方法。具体而言,我们利用视觉语言模型感知环境状态并生成高层任务规划,随后通过目标条件动作扩散模型预测底层动作序列。该模型从离线数据中学习底层策略,以高层规划引导的累积奖励作为目标条件,从而提升机器人动作预测能力。实验结果表明,我们的方法不仅能有效执行不可抓取任务,还能泛化至未见物体。在仿真环境中以47%更高的成功率超越基线方法,并在实际场景中实现高效部署与操作。