We present a novel language-driven ordering alignment method for ordinal classification. The labels in ordinal classification contain additional ordering relations, making them prone to overfitting when relying solely on training data. Recent developments in pre-trained vision-language models inspire us to leverage the rich ordinal priors in human language by converting the original task into a visionlanguage alignment task. Consequently, we propose L2RCLIP, which fully utilizes the language priors from two perspectives. First, we introduce a complementary prompt tuning technique called RankFormer, designed to enhance the ordering relation of original rank prompts. It employs token-level attention with residual-style prompt blending in the word embedding space. Second, to further incorporate language priors, we revisit the approximate bound optimization of vanilla cross-entropy loss and restructure it within the cross-modal embedding space. Consequently, we propose a cross-modal ordinal pairwise loss to refine the CLIP feature space, where texts and images maintain both semantic alignment and ordering alignment. Extensive experiments on three ordinal classification tasks, including facial age estimation, historical color image (HCI) classification, and aesthetic assessment demonstrate its promising performance. The code is available at https://github.com/raywang335/L2RCLIP.
翻译:我们提出了一种新颖的语言驱动排序对齐方法,用于序数分类任务。序数分类中的标签包含附加的排序关系,导致仅依赖训练数据时容易过拟合。预训练视觉-语言模型的最新进展启发我们通过将原始任务转化为视觉-语言对齐任务,充分利用人类语言中丰富的序数先验知识。为此,我们提出L2RCLIP,从两个角度充分挖掘语言先验。首先,我们引入一种名为RankFormer的互补提示调优技术,旨在增强原始排序提示的排序关系。该方法在词嵌入空间中采用基于残差风格提示融合的令牌级注意力机制。其次,为进一步融入语言先验,我们重新审视标准交叉熵损失的近似界优化,并在跨模态嵌入空间中对其进行重构。据此,我们提出一种跨模态序数成对损失来优化CLIP特征空间,使文本与图像兼具语义对齐与排序对齐。在包括人脸年龄估计、历史彩色图像分类与美学评估三项序数分类任务上的大量实验证明了其优越性能。代码已开源至https://github.com/raywang335/L2RCLIP。