Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.
翻译:街景图像属性分类是图像分类的重要下游任务,可应用于自动驾驶、城市分析和高精度地图构建等领域。无论是从头训练、基于预训练权重初始化还是微调大型模型,该任务仍具有较高的计算需求。尽管CLIP等预训练视觉语言模型提供了丰富的图像表征,但现有的自适应或微调方法通常依赖其全局图像嵌入,限制了模型在复杂混乱的街景场景中捕获细粒度局部关键属性的能力。为此,我们提出CLIP-MHAdapter——当前轻量级CLIP自适应范式的变体,该模型通过附加配备多头自注意力机制的瓶颈MLP来处理图像块标记,从而建模块间依赖关系。仅包含约140万个可训练参数的CLIP-MHAdapter,在Global StreetScapes数据集的八项属性分类任务中均取得最优或具有竞争力的准确率,在保持低计算成本的同时实现了新的最先进性能。代码已开源:https://github.com/SpaceTimeLab/CLIP-MHAdapter。