Language-guided grasping has emerged as a promising paradigm for enabling robots to identify and manipulate target objects through natural language instructions, yet it remains highly challenging in cluttered or occluded scenes. Existing methods often rely on multi-stage pipelines that separate object perception and grasping, which leads to limited cross-modal fusion, redundant computation, and poor generalization in cluttered, occluded, or low-texture scenes. To address these limitations, we propose GeoLanG, an end-to-end multi-task framework built upon the CLIP architecture that unifies visual and linguistic inputs into a shared representation space for robust semantic alignment and improved generalization. To enhance target discrimination under occlusion and low-texture conditions, we explore a more effective use of depth information through the Depth-guided Geometric Module (DGGM), which converts depth into explicit geometric priors and injects them into the attention mechanism without additional computational overhead. In addition, we propose Adaptive Dense Channel Integration, which adaptively balances the contributions of multi-layer features to produce more discriminative and generalizable visual representations. Extensive experiments on the OCID-VLG dataset, as well as in both simulation and real-world hardware, demonstrate that GeoLanG enables precise and robust language-guided grasping in complex, cluttered environments, paving the way toward more reliable multimodal robotic manipulation in real-world human-centric settings.
翻译:语言引导抓取已成为一种有前景的范式,使机器人能够通过自然语言指令识别和操作目标物体,但在杂乱或遮挡场景中仍极具挑战性。现有方法通常依赖分离物体感知与抓取的多阶段流水线,这导致跨模态融合有限、计算冗余,且在杂乱、遮挡或低纹理场景中泛化能力差。为应对这些局限,我们提出GeoLanG——一个基于CLIP架构构建的端到端多任务框架,它将视觉与语言输入统一到共享表示空间中,以实现鲁棒的语义对齐并提升泛化性能。为增强遮挡和低纹理条件下的目标辨别能力,我们通过深度引导几何模块(DGGM)探索更有效的深度信息利用方式:该模块将深度转换为显式几何先验,并将其注入注意力机制中,且不引入额外计算开销。此外,我们提出自适应密集通道集成方法,可自适应平衡多层特征的贡献,从而生成更具判别力与泛化能力的视觉表示。在OCID-VLG数据集上进行的广泛实验,以及在仿真和真实硬件环境中的验证均表明,GeoLanG能够在复杂、杂乱的场景中实现精确且鲁棒的语言引导抓取,为在真实世界以人为中心的环境中实现更可靠的多模态机器人操作铺平了道路。