Unified Lexical Representation for Interpretable Visual-Language Alignment

Visual-Language Alignment (VLA) has gained a lot of attention since CLIP's groundbreaking work. Although CLIP performs well, the typical direct latent feature alignment lacks clarity in its representation and similarity scores. On the other hand, lexical representation, a vector whose element represents the similarity between the sample and a word from the vocabulary, is a natural sparse representation and interpretable, providing exact matches for individual words. However, lexical representations are difficult to learn due to no ground-truth supervision and false-discovery issues, and thus requires complex design to train effectively. In this paper, we introduce LexVLA, a more interpretable VLA framework by learning a unified lexical representation for both modalities without complex design. We use DINOv2 as our visual model for its local-inclined features and Llama 2, a generative language model, to leverage its in-context lexical prediction ability. To avoid the false discovery, we propose an overuse penalty to refrain the lexical representation from falsely frequently activating meaningless words. We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on the modest multi-modal dataset and avoid intricate training configurations. On cross-modal retrieval benchmarks, LexVLA, trained on the CC-12M multi-modal dataset, outperforms baselines fine-tuned on larger datasets (e.g., YFCC15M) and those trained from scratch on even bigger datasets (e.g., 1.1B data, including CC-12M). We conduct extensive experiments to analyze LexVLA. Codes are available at https://github.com/Clementine24/LexVLA.

翻译：自CLIP的开创性工作以来，视觉-语言对齐（VLA）受到了广泛关注。尽管CLIP性能优异，但其典型的直接潜在特征对齐在表征和相似度得分方面缺乏清晰性。另一方面，词汇表征——其每个元素代表样本与词汇表中某个单词相似度的向量——是一种天然的稀疏表征，具有可解释性，能为单个单词提供精确匹配。然而，由于缺乏真实监督和存在误发现问题，词汇表征难以学习，通常需要复杂的设计才能有效训练。本文提出LexVLA，一个通过为双模态学习统一词汇表征来实现更高可解释性的VLA框架，且无需复杂设计。我们采用DINOv2作为视觉模型（因其具有局部倾向特征），并选用生成式语言模型Llama 2以利用其上下文词汇预测能力。为避免误发现，我们提出过用惩罚机制，防止词汇表征错误地频繁激活无意义词汇。实验证明，这两个预训练单模态模型通过在适规模的多模态数据集上微调即可实现良好对齐，且无需复杂的训练配置。在跨模态检索基准测试中，基于CC-12M多模态数据集训练的LexVLA，其性能优于在更大数据集（如YFCC15M）上微调的基线模型，甚至超越了在超大规模数据集（如包含CC-12M的11亿数据）上从头训练的模型。我们通过大量实验对LexVLA进行了深入分析。代码发布于https://github.com/Clementine24/LexVLA。