How can a robot efficiently extract a desired object from a shelf when it is fully occluded by other objects? Prior works propose geometric approaches for this problem but do not consider object semantics. Shelves in pharmacies, restaurant kitchens, and grocery stores are often organized such that semantically similar objects are placed close to one another. Can large language models (LLMs) serve as semantic knowledge sources to accelerate robotic mechanical search in semantically arranged environments? With Semantic Spatial Search on Shelves (S^4), we use LLMs to generate affinity matrices, where entries correspond to semantic likelihood of physical proximity between objects. We derive semantic spatial distributions by synthesizing semantics with learned geometric constraints. S^4 incorporates Optical Character Recognition (OCR) and semantic refinement with predictions from ViLD, an open-vocabulary object detection model. Simulation experiments suggest that semantic spatial search reduces the search time relative to pure spatial search by an average of 24% across three domains: pharmacy, kitchen, and office shelves. A manually collected dataset of 100 semantic scenes suggests that OCR and semantic refinement improve object detection accuracy by 35%. Lastly, physical experiments in a pharmacy shelf suggest 47.1% improvement over pure spatial search. Supplementary material can be found at https://sites.google.com/view/s4-rss/home.
翻译:机器人如何从被其他物体完全遮挡的货架中高效提取目标物体?现有工作针对这一问题提出了几何方法,但未考虑物体语义。药店、餐厅厨房和杂货店的货架通常按语义相似性相邻摆放物品。大语言模型(LLM)能否作为语义知识源加速机器人在语义排列环境中的机械搜索?我们提出语义空间货架搜索(S^4),利用LLM生成亲和矩阵,其元素对应物体间物理邻近性的语义似然。通过将语义与学习得到的几何约束相结合,推导出语义空间分布。S^4融合了光学字符识别(OCR)和基于开放词汇目标检测模型ViLD预测的语义精炼。仿真实验表明,在药店、厨房和办公室货架三个领域中,语义空间搜索相比纯空间搜索平均减少24%的搜索时间。基于手动收集的100个语义场景数据集显示,OCR与语义精炼使目标检测准确率提升35%。最后,在药店货架的物理实验中,该方法较纯空间搜索取得47.1%的性能提升。补充材料详见https://sites.google.com/view/s4-rss/home。