Domain generalization studies the problem of training a model with samples from several domains (or distributions) and then testing the model with samples from a new, unseen domain. In this paper, we propose a novel approach for domain generalization that leverages recent advances in large vision-language models, specifically a CLIP teacher model, to train a smaller model that generalizes to unseen domains. The key technical contribution is a new type of regularization that requires the student's learned image representations to be close to the teacher's learned text representations obtained from encoding the corresponding text descriptions of images. We introduce two designs of the loss function, absolute and relative distance, which provide specific guidance on how the training process of the student model should be regularized. We evaluate our proposed method, dubbed RISE (Regularized Invariance with Semantic Embeddings), on various benchmark datasets and show that it outperforms several state-of-the-art domain generalization methods. To our knowledge, our work is the first to leverage knowledge distillation using a large vision-language model for domain generalization. By incorporating text-based information, RISE improves the generalization capability of machine learning models.
翻译:领域泛化研究的问题是如何利用来自多个领域(或分布)的样本训练模型,并随后用来自未见新领域的样本测试模型。本文提出了一种利用大型视觉语言模型(特别是CLIP教师模型)最新进展的领域泛化新方法,旨在训练一个能泛化至未见领域的小模型。关键技术贡献在于提出了一种新型正则化方法,要求学生的图像表示与教师通过编码图像对应文本描述获得的文本表示保持接近。我们设计了两种损失函数(绝对距离与相对距离),为学生的训练过程提供具体正则化指导。我们将所提方法命名为RISE(基于语义嵌入的正则化不变性),并在多个基准数据集上进行了评估,结果表明该方法优于多种现有领域泛化方法。据我们所知,本文首次将基于大型视觉语言模型的知识蒸馏应用于领域泛化。通过整合文本信息,RISE显著提升了机器学习模型的泛化能力。