The classical human-robot interface in uncalibrated image-based visual servoing (UIBVS) relies on either human annotations or semantic segmentation with categorical labels. Both methods fail to match natural human communication and convey rich semantics in manipulation tasks as effectively as natural language expressions. In this paper, we tackle this problem by using referring expression segmentation, which is a prompt-based approach, to provide more in-depth information for robot perception. To generate high-quality segmentation predictions from referring expressions, we propose CLIPUNetr - a new CLIP-driven referring expression segmentation network. CLIPUNetr leverages CLIP's strong vision-language representations to segment regions from referring expressions, while utilizing its ``U-shaped'' encoder-decoder architecture to generate predictions with sharper boundaries and finer structures. Furthermore, we propose a new pipeline to integrate CLIPUNetr into UIBVS and apply it to control robots in real-world environments. In experiments, our method improves boundary and structure measurements by an average of 120% and can successfully assist real-world UIBVS control in an unstructured manipulation environment.
翻译:经典的非标定基于图像的视觉伺服(UIBVS)人机接口依赖于人工标注或基于类别标签的语义分割。这两种方法均无法像自然语言表达那样匹配人类自然交互方式,并在操作任务中有效传递丰富语义信息。本文采用提示驱动的指代表达分割方法解决该问题,为机器人感知提供更深入的语义信息。为了从指代表达中生成高质量的分割预测,我们提出CLIPUNetr——一种新型的CLIP驱动指代表达分割网络。CLIPUNetr利用CLIP强大的视觉-语言表征能力从指代表达中分割区域,同时通过其“U形”编码器-解码器架构生成具有更清晰边界和更精细结构的分割预测。进一步地,我们提出将CLIPUNetr集成至UIBVS的新流水线,并将其应用于真实环境下的机器人控制。实验表明,本方法在边界与结构测量指标上平均提升120%,可成功辅助非结构化操作环境中的真实UIBVS控制任务。