Referential grounding in outdoor driving scenes is challenging due to large scene variability, many visually similar objects, and dynamic elements that complicate resolving natural-language references (e.g., "the black car on the right"). We propose LLM-RG, a hybrid pipeline that combines off-the-shelf vision-language models for fine-grained attribute extraction with large language models for symbolic reasoning. LLM-RG processes an image and a free-form referring expression by using an LLM to extract relevant object types and attributes, detecting candidate regions, generating rich visual descriptors with a VLM, and then combining these descriptors with spatial metadata into natural-language prompts that are input to an LLM for chain-of-thought reasoning to identify the referent's bounding box. Evaluated on the Talk2Car benchmark, LLM-RG yields substantial gains over both LLM and VLM-based baselines. Additionally, our ablations show that adding 3D spatial cues further improves grounding. Our results demonstrate the complementary strengths of VLMs and LLMs, applied in a zero-shot manner, for robust outdoor referential grounding.
翻译:户外驾驶场景中的指称接地任务面临诸多挑战,包括场景变化大、视觉相似物体多以及动态元素干扰,这些因素使得解析自然语言指称(例如“右侧的黑色汽车”)变得复杂。我们提出LLM-RG,一种混合处理流程,该流程结合了现成的视觉-语言模型(用于细粒度属性提取)与大语言模型(用于符号推理)。LLM-RG通过以下步骤处理图像和自由形式的指称表达式:首先使用大语言模型提取相关物体类型与属性,检测候选区域;接着通过视觉-语言模型生成丰富的视觉描述符;然后将这些描述符与空间元数据结合,构建成自然语言提示输入到大语言模型中,通过思维链推理最终确定指称物体的边界框。在Talk2Car基准测试上的评估表明,LLM-RG相较于基于大语言模型和视觉-语言模型的基线方法均取得了显著提升。此外,我们的消融实验显示,引入三维空间线索能进一步提升接地性能。研究结果证明了视觉-语言模型与大语言模型在零样本设置下应用于鲁棒的户外指称接地任务时具有互补优势。