We introduce Gaussian masking for Language-Image Pre-Training (GLIP) a novel, straightforward, and effective technique for masking image patches during pre-training of a vision-language model. GLIP builds on Fast Language-Image Pre-Training (FLIP), which randomly masks image patches while training a CLIP model. GLIP replaces random masking with centered masking, that uses a Gaussian distribution and is inspired by the importance of image patches at the center of the image. GLIP retains the same computational savings as FLIP, while improving performance across a range of downstream datasets and tasks, as demonstrated by our experimental results. We show the benefits of GLIP to be easy to obtain, requiring no delicate tuning of the Gaussian, and also applicable to data sets containing images without an obvious center focus.
翻译:我们提出了一种面向语言-图像预训练的高斯掩码(GLIP)技术,这是一种在视觉语言模型预训练阶段对图像块进行掩码操作的创新、直接且高效的方法。GLIP 基于快速语言-图像预训练(FLIP)技术构建,后者在训练 CLIP 模型时对图像块进行随机掩码。GLIP 将随机掩码替换为基于高斯分布的中心化掩码,其灵感来源于图像中心区域图像块的重要性。GLIP 在保持与 FLIP 相同计算效率的同时,通过实验证明在多个下游数据集和任务中均能提升性能。我们证实 GLIP 的优势易于获得:无需对高斯函数进行精细调参,且可适用于不含明显中心聚焦的图像数据集。