Referential grounding in outdoor driving scenes is challenging due to large scene variability, many visually similar objects, and dynamic elements that complicate resolving natural-language references (e.g., "the black car on the right"). We propose LLM-RG, a hybrid pipeline that combines off-the-shelf vision-language models for fine-grained attribute extraction with large language models for symbolic reasoning. LLM-RG processes an image and a free-form referring expression by using an LLM to extract relevant object types and attributes, detecting candidate regions, generating rich visual descriptors with a VLM, and then combining these descriptors with spatial metadata into natural-language prompts that are input to an LLM for chain-of-thought reasoning to identify the referent's bounding box. Evaluated on the Talk2Car benchmark, LLM-RG yields substantial gains over both LLM and VLM-based baselines. Additionally, our ablations show that adding 3D spatial cues further improves grounding. Our results demonstrate the complementary strengths of VLMs and LLMs, applied in a zero-shot manner, for robust outdoor referential grounding.
翻译:户外驾驶场景中的指称接地任务面临诸多挑战,包括场景变化大、大量视觉相似物体以及动态元素,这些都使得解析自然语言指称(例如“右边那辆黑色的车”)变得复杂。我们提出LLM-RG,一种混合流程,它结合了用于细粒度属性提取的现成视觉语言模型和用于符号推理的大语言模型。LLM-RG通过以下步骤处理图像和自由形式的指代表达式:使用LLM提取相关物体类型和属性,检测候选区域,利用VLM生成丰富的视觉描述符,然后将这些描述符与空间元数据结合成自然语言提示,输入给LLM进行思维链推理,以识别指称物体的边界框。在Talk2Car基准上的评估表明,LLM-RG相比基于LLM和VLM的基线方法取得了显著提升。此外,我们的消融实验表明,添加3D空间线索能进一步提升接地性能。我们的结果证明了以零样本方式应用的VLM与LLM在鲁棒的户外指称接地任务中具有互补优势。