Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations

Prior research on out-of-distribution detection (OoDD) has primarily focused on single-modality models. Recently, with the advent of large-scale pretrained vision-language models such as CLIP, OoDD methods utilizing such multi-modal representations through zero-shot and prompt learning strategies have emerged. However, these methods typically involve either freezing the pretrained weights or only partially tuning them, which can be suboptimal for downstream datasets. In this paper, we highlight that multi-modal fine-tuning (MMFT) can achieve notable OoDD performance. Despite some recent works demonstrating the impact of fine-tuning methods for OoDD, there remains significant potential for performance improvement. We investigate the limitation of na\"ive fine-tuning methods, examining why they fail to fully leverage the pretrained knowledge. Our empirical analysis suggests that this issue could stem from the modality gap within in-distribution (ID) embeddings. To address this, we propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data. This adjustment helps in better utilizing pretrained textual information by aligning similar semantics from different modalities (i.e., text and image) more closely in the hyperspherical representation space. We theoretically demonstrate that the proposed regularization corresponds to the maximum likelihood estimation of an energy-based model on a hypersphere. Utilizing ImageNet-1k OoD benchmark datasets, we show that our method, combined with post-hoc OoDD approaches leveraging pretrained knowledge (e.g., NegLabel), significantly outperforms existing methods, achieving state-of-the-art OoDD performance and leading ID accuracy.

翻译：先前关于分布外检测的研究主要集中于单模态模型。近期，随着CLIP等大规模预训练视觉-语言模型的出现，利用此类多模态表示通过零样本学习和提示学习策略的OoDD方法逐渐兴起。然而，这些方法通常冻结预训练权重或仅进行部分微调，对于下游数据集可能并非最优。本文指出，多模态微调能够实现显著的OoDD性能提升。尽管近期部分研究已展示微调方法对OoDD的影响，但其性能仍存在巨大改进空间。我们探究了朴素微调方法的局限性，分析其未能充分利用预训练知识的原因。实证分析表明，该问题可能源于分布内数据嵌入中的模态间隙。为此，我们提出一种通过正则化ID数据的图像与文本嵌入距离来增强跨模态对齐的训练目标。该调整通过在超球面表示空间中将不同模态（即文本与图像）的相似语义更紧密地对齐，有助于更好地利用预训练文本信息。我们从理论上证明，所提出的正则化方法对应于超球面上基于能量的模型的最大似然估计。基于ImageNet-1k OoD基准数据集的实验表明，我们的方法结合利用预训练知识的后验OoDD方法（如NegLabel），显著优于现有方法，实现了最先进的OoDD性能并保持领先的ID分类准确率。