Vision Transformer (ViT) based Vision-Language Pre-training (VLP) models have demonstrated impressive performance in various tasks. However, the lengthy visual token sequences fed into ViT can lead to training inefficiency and ineffectiveness. Existing efforts address the challenge by either bottom-level patch extraction in the ViT backbone or top-level patch abstraction outside, not balancing training efficiency and effectiveness well. Inspired by text summarization in natural language processing, we propose a Bottom-Up Patch Summarization approach named BUS, coordinating bottom-level extraction and top-level abstraction to learn a concise summary of lengthy visual token sequences efficiently. Specifically, We incorporate a Text-Semantics-Aware Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual token extraction and then attach a flexible Transformer-based Patch Abstraction Decoder (PAD) upon the backbone for top-level visual abstraction. This bottom-up collaboration enables our BUS to yield high training efficiency while maintaining or even improving effectiveness. We evaluate our approach on various visual-language understanding and generation tasks and show competitive downstream task performance while boosting the training efficiency by 50\%. Additionally, our model achieves state-of-the-art performance on many downstream tasks by increasing input image resolution without increasing computational costs over baselines.
翻译:基于Vision Transformer(ViT)的视觉-语言预训练(VLP)模型已在各项任务中展现出卓越性能。然而,输入ViT的长篇视觉Token序列会导致训练效率低下与有效性不足。现有方法通过在ViT骨干网络中进行底层补丁提取或在其外部进行顶层补丁抽象来应对这一挑战,但未能平衡训练效率与有效性。受自然语言处理中文本摘要技术的启发,我们提出一种名为BUS的自底向上补丁摘要方法,通过协调底层提取与顶层抽象,高效学习长篇视觉Token序列的简洁摘要。具体而言,我们在ViT骨干网络中融入文本语义感知补丁选择器(TSPS)进行粗粒度视觉Token提取,并在骨干网络之上附加灵活的基于Transformer的补丁抽象解码器(PAD)进行顶层视觉抽象。这种自底向上协作使我们的BUS能够在保持甚至提升有效性的同时实现高训练效率。我们在各类视觉-语言理解与生成任务上评估该方法,在将训练效率提升50%的同时展现出具有竞争力的下游任务性能。此外,通过在不增加计算成本的前提下提高输入图像分辨率,我们的模型在多项下游任务中达到了最先进的性能水平。