The demand for the retrieval of complex scene data in autonomous driving is increasing, especially as passenger vehicles have been equipped with the ability to navigate urban settings, with the imperative to address long-tail scenarios. Meanwhile, under the pre-existing two dimensional image retrieval method, some problems may arise with scene retrieval, such as lack of global feature representation and subpar text retrieval ability. To address these issues, we have proposed \textbf{BEV-CLIP}, the first multimodal Bird's-Eye View(BEV) retrieval methodology that utilizes descriptive text as an input to retrieve corresponding scenes. This methodology applies the semantic feature extraction abilities of a large language model (LLM) to facilitate zero-shot retrieval of extensive text descriptions, and incorporates semi-structured information from a knowledge graph to improve the semantic richness and variety of the language embedding. Our experiments result in 87.66% accuracy on NuScenes dataset in text-to-BEV feature retrieval. The demonstrated cases in our paper support that our retrieval method is also indicated to be effective in identifying certain long-tail corner scenes.
翻译:自主驾驶中复杂场景数据检索的需求日益增长,尤其是在乘用车已具备城市环境导航能力、亟需应对长尾场景的背景下。与此同时,基于现有二维图像检索方法进行场景检索时,可能出现全局特征表征不足与文本检索能力欠佳等问题。针对上述挑战,我们提出了\textbf{BEV-CLIP}——首个利用描述性文本作为输入进行场景检索的多模态鸟瞰视角(BEV)检索方法。该方法借助大语言模型(LLM)的语义特征提取能力,实现大规模文本描述的零样本检索,并融合知识图谱的半结构化信息以增强语言嵌入的语义丰富性与多样性。实验结果显示,在NuScenes数据集上的文本到BEV特征检索准确率达到87.66%。本文展示的案例表明,该检索方法在识别特定长尾边缘场景方面亦具有显著有效性。