Interacting with real-world cluttered scenes pose several challenges to robotic agents that need to understand complex spatial dependencies among the observed objects to determine optimal pick sequences or efficient object retrieval strategies. Existing solutions typically manage simplified scenarios and focus on predicting pairwise object relationships following an initial object detection phase, but often overlook the global context or struggle with handling redundant and missing object relations. In this work, we present a modern take on visual relational reasoning for grasp planning. We introduce D3GD, a novel testbed that includes bin picking scenes with up to 35 objects from 97 distinct categories. Additionally, we propose D3G, a new end-to-end transformer-based dependency graph generation model that simultaneously detects objects and produces an adjacency matrix representing their spatial relationships. Recognizing the limitations of standard metrics, we employ the Average Precision of Relationships for the first time to evaluate model performance, conducting an extensive experimental benchmark. The obtained results establish our approach as the new state-of-the-art for this task, laying the foundation for future research in robotic manipulation. We publicly release the code and dataset at https://paolotron.github.io/d3g.github.io.
翻译:与现实世界杂乱场景交互对机器人智能体提出了多重挑战,这些智能体需要理解观测物体间复杂的空间依赖关系,以确定最优抓取序列或高效物体检索策略。现有解决方案通常处理简化场景,并在初始物体检测阶段后专注于预测成对物体关系,但往往忽略全局上下文,或难以处理冗余及缺失的物体关系。本研究提出一种面向抓取规划的现代视觉关系推理方法。我们引入D3GD——一个包含多达97个不同类别、35个物体的箱式分拣场景的新型测试平台。此外,我们提出D3G模型,这是一种基于Transformer的端到端依赖图生成模型,能够同步检测物体并生成表征其空间关系的邻接矩阵。针对标准评估指标的局限性,我们首次采用关系平均精度(Average Precision of Relationships)评估模型性能,并进行了全面的实验基准测试。所得结果表明我们的方法在该任务中达到了新的最优水平,为机器人操控领域的未来研究奠定了基础。代码与数据集已公开发布于https://paolotron.github.io/d3g.github.io。