Image-caption pretraining has been quite successfully used for downstream vision tasks like zero-shot image classification and object detection. However, image-caption pretraining is still a hard problem -- it requires multiple concepts (nouns) from captions to be aligned to several objects in images. To tackle this problem, we go to the roots -- the best learner, children. We take inspiration from cognitive science studies dealing with children's language learning to propose a curriculum learning framework. The learning begins with easy-to-align image caption pairs containing one concept per caption. The difficulty is progressively increased with each new phase by adding one more concept per caption. Correspondingly, the knowledge acquired in each learning phase is utilized in subsequent phases to effectively constrain the learning problem to aligning one new concept-object pair in each phase. We show that this learning strategy improves over vanilla image-caption training in various settings -- pretraining from scratch, using a pretrained image or/and pretrained text encoder, low data regime etc.
翻译:图像-文本预训练已成功应用于零样本图像分类和目标检测等下游视觉任务。然而,图像-文本预训练仍是一个难题——需要将文本描述中的多个概念(名词)与图像中的多个对象对齐。为解决这一问题,我们追溯本源——最佳学习者,即儿童。我们从儿童语言学习的认知科学研究中获得启发,提出一种课程式学习框架。该学习过程从包含单一概念的易对齐图像-文本对开始,随着每个新阶段的推进,通过在每个文本描述中增加一个概念来逐步提升难度。相应地,每个学习阶段获得的知识被用于后续阶段,从而有效约束学习问题,使每个阶段仅对齐一个新概念-对象对。我们证明,这种学习策略在多种设置下均优于标准图像-文本训练,包括从头预训练、使用预训练图像编码器和/或文本编码器、低数据资源场景等。