Vision Language Models (VLMs) can be trained more efficiently if training sets can be reduced in size. Recent work has shown the benefits of masking text during VLM training using a variety of strategies (truncation, random masking, block masking and syntax masking) and has reported syntax masking as the top performer. In this paper, we analyze the impact of different text masking strategies on the word frequency in the training data, and show that this impact is connected to model success. This finding motivates Contrastive Language-Image Pre-training with Word Frequency Masking (CLIPF), our proposed masking approach, which directly leverages word frequency. Extensive experiments demonstrate the advantages of CLIPF over syntax masking and other existing approaches, particularly when the number of input tokens decreases. We show that not only CLIPF, but also other existing masking strategies, outperform syntax masking when enough epochs are used during training, a finding of practical importance for selecting a text masking method for VLM training. Our code is available online.
翻译:若能缩减训练集规模,视觉语言模型(VLM)的训练效率将得以提升。近期研究表明,在VLM训练期间采用多种文本掩码策略(截断、随机掩码、块掩码及句法掩码)具有积极效果,其中句法掩码被报告为性能最佳的方法。本文通过分析不同文本掩码策略对训练数据中词频分布的影响,揭示了该影响与模型性能之间的关联性。这一发现促使我们提出基于词频掩码的对比语言-图像预训练方法(CLIPF),该掩码策略直接利用词频信息。大量实验证明,CLIPF在输入标记数量减少时,相较于句法掩码及其他现有方法具有显著优势。我们进一步发现,当训练周期足够时,不仅CLIPF,其他现有掩码策略也能超越句法掩码的性能,这一结论对于VLM训练中文本掩码方法的选择具有重要实践意义。相关代码已公开。