Out-of-distribution (OOD) detection is essential in autonomous driving, to determine when learning-based components encounter unexpected inputs. Traditional detectors typically use encoder models with fixed settings, thus lacking effective human interaction capabilities. With the rise of large foundation models, multimodal inputs offer the possibility of taking human language as a latent representation, thus enabling language-defined OOD detection. In this paper, we use the cosine similarity of image and text representations encoded by the multimodal model CLIP as a new representation to improve the transparency and controllability of latent encodings used for visual anomaly detection. We compare our approach with existing pre-trained encoders that can only produce latent representations that are meaningless from the user's standpoint. Our experiments on realistic driving data show that the language-based latent representation performs better than the traditional representation of the vision encoder and helps improve the detection performance when combined with standard representations.
翻译:分布外(OOD)检测在自动驾驶中至关重要,用于确定基于学习的组件何时遇到意外输入。传统检测器通常使用固定设置的编码器模型,因此缺乏有效的人机交互能力。随着大型基础模型的兴起,多模态输入使人类语言作为潜在表示成为可能,从而实现了语言定义的OOD检测。本文利用多模态模型CLIP编码的图像和文本表示的余弦相似度作为一种新的表示方式,以提高用于视觉异常检测的潜在编码的透明性和可控性。我们将其与现有预训练编码器(只能生成从用户角度无意义的潜在表示)进行了比较。在真实驾驶数据上的实验表明,基于语言的潜在表示优于视觉编码器的传统表示,并且在结合标准表示时有助于提升检测性能。