Even when using large multi-modal foundation models, few-shot learning is still challenging -- if there is no proper inductive bias, it is nearly impossible to keep the nuanced class attributes while removing the visually prominent attributes that spuriously correlate with class labels. To this end, we find an inductive bias that the time-steps of a Diffusion Model (DM) can isolate the nuanced class attributes, i.e., as the forward diffusion adds noise to an image at each time-step, nuanced attributes are usually lost at an earlier time-step than the spurious attributes that are visually prominent. Building on this, we propose Time-step Few-shot (TiF) learner. We train class-specific low-rank adapters for a text-conditioned DM to make up for the lost attributes, such that images can be accurately reconstructed from their noisy ones given a prompt. Hence, at a small time-step, the adapter and prompt are essentially a parameterization of only the nuanced class attributes. For a test image, we can use the parameterization to only extract the nuanced class attributes for classification. TiF learner significantly outperforms OpenCLIP and its adapters on a variety of fine-grained and customized few-shot learning tasks. Codes are in https://github.com/yue-zhongqi/tif.
翻译:即使使用大规模多模态基础模型,少样本学习仍然具有挑战性——若缺乏适当的归纳偏置,几乎不可能在去除与类别标签虚假相关的显著视觉属性时,保留细微的类别属性。为此,我们发现扩散模型时间步具有分离细微类别属性的归纳偏置:当正向扩散在每个时间步向图像添加噪声时,细微属性通常比视觉显著的虚假属性更早丢失。基于此,我们提出时间步少样本学习器(TiF)。我们为文本条件扩散模型训练类别特定的低秩适配器以补偿丢失的属性,使得图像能够根据提示从噪声版本中准确重建。因此,在小时间步下,适配器和提示本质上仅为细微类别属性的参数化。对于测试图像,我们可利用该参数化仅提取细微类别属性进行分类。TiF学习器在多种细粒度和定制化少样本学习任务上显著优于OpenCLIP及其适配器。代码见https://github.com/yue-zhongqi/tif。