This paper aims to establish a generic multi-modal foundation model that has the scalable capability to massive downstream applications in E-commerce. Recently, large-scale vision-language pretraining approaches have achieved remarkable advances in the general domain. However, due to the significant differences between natural and product images, directly applying these frameworks for modeling image-level representations to E-commerce will be inevitably sub-optimal. To this end, we propose an instance-centric multi-modal pretraining paradigm called ECLIP in this work. In detail, we craft a decoder architecture that introduces a set of learnable instance queries to explicitly aggregate instance-level semantics. Moreover, to enable the model to focus on the desired product instance without reliance on expensive manual annotations, two specially configured pretext tasks are further proposed. Pretrained on the 100 million E-commerce-related data, ECLIP successfully extracts more generic, semantic-rich, and robust representations. Extensive experimental results show that, without further fine-tuning, ECLIP surpasses existing methods by a large margin on a broad range of downstream tasks, demonstrating the strong transferability to real-world E-commerce applications.
翻译:本文旨在构建一个通用的多模态基础模型,使其具备可扩展能力,以支撑电商领域的海量下游应用。近年来,大规模视觉-语言预训练方法在通用领域取得了显著进展。然而,由于自然图像与商品图像之间存在显著差异,直接将这些基于图像级表征建模的框架应用于电商场景将不可避免地导致性能欠佳。为此,本文提出了一种以实例为中心的多模态预训练范式——ECLIP。具体而言,我们设计了一种解码器架构,通过引入一组可学习的实例查询来显式聚合实例级语义。此外,为使模型能够聚焦于目标商品实例而无需依赖昂贵的人工标注,本文进一步提出了两个经过特殊配置的预训练任务。基于1亿条电商相关数据的预训练,ECLIP成功提取出更具通用性、语义更丰富且更鲁棒的表征。大量实验结果表明,无需额外微调,ECLIP在广泛的下游任务上以显著优势超越现有方法,展现出其在实际电商应用中的强迁移能力。