Context Optimization (CoOp) has emerged as a simple yet effective technique for adapting CLIP-like vision-language models to downstream image recognition tasks. Nevertheless, learning compact context with satisfactory base-to-new, domain and cross-task generalization ability while adapting to new tasks is still a challenge. To tackle such a challenge, we propose a lightweight yet generalizable approach termed Compositional Kronecker Context Optimization (CK-CoOp). Technically, the prompt's context words in CK-CoOp are learnable vectors, which are crafted by linearly combining base vectors sourced from a dictionary. These base vectors consist of a non-learnable component obtained by quantizing the weights in the token embedding layer, and a learnable component constructed by applying Kronecker product on several learnable tiny matrices. Intuitively, the compositional structure mitigates the risk of overfitting on training data by remembering more pre-trained knowledge. Meantime, the Kronecker product breaks the non-learnable restrictions of the dictionary, thereby enhancing representation ability with minimal additional parameters. Extensive experiments confirm that CK-CoOp achieves state-of-the-art performance under base-to-new, domain and cross-task generalization evaluation, but also has the metrics of fewer learnable parameters and efficient training and inference speed.
翻译:上下文优化(CoOp)已成为一种简单而有效的技术,用于将CLIP类视觉-语言模型适配到下游图像识别任务。然而,在适配新任务的同时,学习具有良好基类-新类、领域及跨任务泛化能力的紧凑上下文仍是一个挑战。为应对这一挑战,我们提出了一种轻量级且可泛化的方法,称为组合式克罗内克上下文优化(CK-CoOp)。在技术上,CK-CoOp中提示词的上下文向量为可学习向量,通过线性组合源自字典的基向量构建而成。这些基向量由两部分组成:一部分是通过对分词嵌入层权重进行量化得到的非可学习分量,另一部分是通过对多个可学习微矩阵应用克罗内克积构建的可学习分量。直观上,组合式结构通过保留更多预训练知识缓解了训练数据过拟合的风险。同时,克罗内克积打破了字典的非可学习限制,从而以极少的额外参数增强了表示能力。大量实验证明,CK-CoOp不仅在基类-新类、领域及跨任务泛化评估中达到最先进性能,而且具有更少的可学习参数以及高效的训练与推理速度。