Teach CLIP to Develop a Number Sense for Ordinal Regression

Ordinal regression is a fundamental problem within the field of computer vision, with customised well-trained models on specific tasks. While pre-trained vision-language models (VLMs) have exhibited impressive performance on various vision tasks, their potential for ordinal regression has received less exploration. In this study, we first investigate CLIP's potential for ordinal regression, from which we expect the model could generalise to different ordinal regression tasks and scenarios. Unfortunately, vanilla CLIP fails on this task, since current VLMs have a well-documented limitation of encapsulating compositional concepts such as number sense. We propose a simple yet effective method called NumCLIP to improve the quantitative understanding of VLMs. We disassemble the exact image to number-specific text matching problem into coarse classification and fine prediction stages. We discretize and phrase each numerical bin with common language concept to better leverage the available pre-trained alignment in CLIP. To consider the inherent continuous property of ordinal regression, we propose a novel fine-grained cross-modal ranking-based regularisation loss specifically designed to keep both semantic and ordinal alignment in CLIP's feature space. Experimental results on three general ordinal regression tasks demonstrate the effectiveness of NumCLIP, with 10% and 3.83% accuracy improvement on historical image dating and image aesthetics assessment task, respectively. Code is publicly available at https://github.com/xmed-lab/NumCLIP.

翻译：序数回归是计算机视觉领域的一个基础问题，通常需要针对特定任务定制训练有素的模型。尽管预训练的视觉-语言模型（VLMs）在各种视觉任务中展现出令人印象深刻的性能，但其在序数回归方面的潜力尚未得到充分探索。在本研究中，我们首先探究了CLIP在序数回归任务中的潜力，期望该模型能够泛化至不同的序数回归任务与场景。遗憾的是，原始CLIP在此任务上表现不佳，因为当前VLMs存在一个公认的局限性，即难以封装诸如数字感知之类的组合概念。我们提出了一种简单而有效的方法，称为NumCLIP，以提升VLMs的定量理解能力。我们将精确的图像到数字特定文本匹配问题分解为粗粒度分类和细粒度预测两个阶段。我们将每个数值区间离散化并用通用语言概念进行表述，以更好地利用CLIP中已有的预训练对齐能力。考虑到序数回归固有的连续性特性，我们提出了一种新颖的基于细粒度跨模态排序的正则化损失函数，专门设计用于在CLIP的特征空间中保持语义和序数对齐。在三个通用序数回归任务上的实验结果表明了NumCLIP的有效性，分别在历史图像年代判定和图像美学评估任务上实现了10%和3.83%的准确率提升。代码公开于https://github.com/xmed-lab/NumCLIP。