The success of pre-training approaches on a variety of downstream tasks has revitalized the field of computer vision. Image aesthetics assessment (IAA) is one of the ideal application scenarios for such methods due to subjective and expensive labeling procedure. In this work, an unified and flexible two-phase \textbf{C}LIP-based \textbf{S}emi-supervised \textbf{K}nowledge \textbf{D}istillation paradigm is proposed, namely \textbf{\textit{CSKD}}. Specifically, we first integrate and leverage a multi-source unlabeled dataset to align rich features between a given visual encoder and an off-the-shelf CLIP image encoder via feature alignment loss. Notably, the given visual encoder is not limited by size or structure and, once well-trained, it can seamlessly serve as a better visual aesthetic learner for both student and teacher. In the second phase, the unlabeled data is also utilized in semi-supervised IAA learning to further boost student model performance when applied in latency-sensitive production scenarios. By analyzing the attention distance and entropy before and after feature alignment, we notice an alleviation of feature collapse issue, which in turn showcase the necessity of feature alignment instead of training directly based on CLIP image encoder. Extensive experiments indicate the superiority of CSKD, which achieves state-of-the-art performance on multiple widely used IAA benchmarks.
翻译:预训练方法在多种下游任务上的成功复兴了计算机视觉领域。图像美学评估(IAA)因其主观且昂贵的标注流程,成为此类方法的理想应用场景之一。本文提出了一种统一且灵活的两阶段**C**LIP驱动的**半**监督**知**识**蒸**馏范式,即**CSKD**。具体而言,我们首先整合并利用多源无标注数据集,通过特征对齐损失,使给定视觉编码器与现成的CLIP图像编码器之间的丰富特征实现对齐。值得注意的是,给定视觉编码器在规模或结构上不受限制,一旦训练完成,它就能无缝地作为更优的视觉美学学习者,同时服务于学生模型与教师模型。在第二阶段,无标注数据还被用于半监督IAA学习,以进一步提升学生模型在延迟敏感型生产场景中的性能。通过分析特征对齐前后的注意力距离与熵,我们注意到特征崩塌问题得到缓解,这反过来表明特征对齐的必要性——相较于直接基于CLIP图像编码器进行训练。大量实验表明CSKD具有优越性,在多个广泛使用的IAA基准测试中达到了最先进水平。