In this work, we present a new visual prompting method called 3DAxiesPrompts (3DAP) to unleash the capabilities of GPT-4V in performing 3D spatial tasks. Our investigation reveals that while GPT-4V exhibits proficiency in discerning the position and interrelations of 2D entities through current visual prompting techniques, its abilities in handling 3D spatial tasks have yet to be explored. In our approach, we create a 3D coordinate system tailored to 3D imagery, complete with annotated scale information. By presenting images infused with the 3DAP visual prompt as inputs, we empower GPT-4V to ascertain the spatial positioning information of the given 3D target image with a high degree of precision. Through experiments, We identified three tasks that could be stably completed using the 3DAP method, namely, 2D to 3D Point Reconstruction, 2D to 3D point matching, and 3D Object Detection. We perform experiments on our proposed dataset 3DAP-Data, the results from these experiments validate the efficacy of 3DAP-enhanced GPT-4V inputs, marking a significant stride in 3D spatial task execution.
翻译:本文提出一种名为3DAxiesPrompts(3DAP)的新型视觉提示方法,以释放GPT-4V在执行3D空间任务中的潜力。研究表明,尽管GPT-4V通过现有视觉提示技术能够精准识别二维实体的位置及其相互关系,但其处理三维空间任务的能力尚未得到探索。我们的方法针对3D图像构建了一个包含标注刻度信息的3D坐标系。通过将嵌入3DAP视觉提示的图像作为输入,我们使GPT-4V能够高精度地获取给定3D目标图像的空间定位信息。通过实验,我们确定了三种可使用3DAP方法稳定完成的任务,即二维到三维点重建、二维到三维点匹配以及3D目标检测。我们在自建数据集3DAP-Data上开展实验,结果验证了经3DAP增强的GPT-4V输入的有效性,这标志着在3D空间任务执行方面迈出了重要一步。