Dataset distillation methods reduce large-scale datasets to smaller sets of synthetic data, which preserve sufficient information for quickly training a new model from scratch. However, prior work on dataset distillation has focused exclusively on image classification datasets, whereas modern large-scale datasets are primarily in the vision-language space. In this work, we design the first vision-language dataset distillation method, building on the idea of trajectory matching. A key challenge is that vision-language datasets do not have a set of discrete classes. To overcome this, our proposed method jointly distills the image-text pairs in a contrastive formulation. Further, we leverage Low-Rank Adaptation (LoRA) matching to enable more efficient and effective trajectory matching in complex modern vision-language models. Since there are no existing baselines, we compare our distillation approach to three adapted vision-language coreset selection methods. We demonstrate significant improvements on the challenging Flickr30K and COCO retrieval benchmarks: for example, on Flickr30K, the best coreset selection method selecting 1000 image-text pairs for training achieves only 5.6% image-to-text retrieval accuracy (i.e., recall@1); in contrast, our dataset distillation approach almost doubles that to 9.9% with just 100 (an order of magnitude fewer) training pairs.
翻译:数据集蒸馏方法将大规模数据集缩减为更小的合成数据集,这些合成数据保留了足够的信息,可用于从头快速训练新模型。然而,先前关于数据集蒸馏的研究仅专注于图像分类数据集,而现代大规模数据集主要属于视觉-语言领域。在这项工作中,我们基于轨迹匹配的思想,设计了首个视觉-语言数据集蒸馏方法。一个关键挑战是视觉-语言数据集没有一组离散的类别。为克服这一问题,我们提出的方法在对比学习框架下联合蒸馏图像-文本对。此外,我们利用低秩适配(LoRA)匹配,实现在复杂现代视觉-语言模型中更高效和有效的轨迹匹配。由于缺乏现有基线,我们将蒸馏方法与三种适配的视觉-语言核心集选择方法进行了比较。在具有挑战性的Flickr30K和COCO检索基准测试中展示了显著改进:例如,在Flickr30K上,最佳核心集选择方法选择1000个图像-文本对进行训练时仅达到5.6%的图像到文本检索准确率(即recall@1);相比之下,我们的数据集蒸馏方法仅用100个(少一个数量级)训练对就将准确率几乎翻倍至9.9%。