Multimodal Large Language Models (MLLMs) have made significant progress in tasks such as image captioning and question answering. However, while these models can generate realistic captions, they often struggle with providing precise instructions, particularly when it comes to localizing and disambiguating objects in complex 3D environments. This capability is critical as MLLMs become more integrated with collaborative robotic systems. In scenarios where a target object is surrounded by similar objects (distractors), robots must deliver clear, spatially-aware instructions to guide humans effectively. We refer to this challenge as contextual object localization and disambiguation, which imposes stricter constraints than conventional 3D dense captioning, especially regarding ensuring target exclusivity. In response, we propose simple yet effective techniques to enhance the model's ability to localize and disambiguate target objects. Our approach not only achieves state-of-the-art performance on conventional metrics that evaluate sentence similarity, but also demonstrates improved 3D spatial understanding through 3D visual grounding model.
翻译:多模态大语言模型(MLLMs)在图像描述生成和问答等任务中取得了显著进展。然而,尽管这些模型能够生成逼真的描述,它们在提供精确指令方面往往存在困难,尤其是在复杂三维环境中定位和区分物体时。随着MLLMs日益融入协作机器人系统,这种能力变得至关重要。当目标物体被相似物体(干扰项)包围时,机器人必须提供清晰且具有空间感知能力的指令,以有效引导人类。我们将这一挑战称为上下文目标定位与消歧,它比传统的三维密集描述任务施加了更严格的约束,特别是在确保目标唯一性方面。为此,我们提出了简单而有效的技术来增强模型定位和区分目标物体的能力。我们的方法不仅在评估句子相似度的传统指标上达到了最先进的性能,还通过三维视觉定位模型展示了改进的三维空间理解能力。