Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce

This paper aims to establish a generic multi-modal foundation model that has the scalable capability to massive downstream applications in E-commerce. Recently, large-scale vision-language pretraining approaches have achieved remarkable advances in the general domain. However, due to the significant differences between natural and product images, directly applying these frameworks for modeling image-level representations to E-commerce will be inevitably sub-optimal. To this end, we propose an instance-centric multi-modal pretraining paradigm called ECLIP in this work. In detail, we craft a decoder architecture that introduces a set of learnable instance queries to explicitly aggregate instance-level semantics. Moreover, to enable the model to focus on the desired product instance without reliance on expensive manual annotations, two specially configured pretext tasks are further proposed. Pretrained on the 100 million E-commerce-related data, ECLIP successfully extracts more generic, semantic-rich, and robust representations. Extensive experimental results show that, without further fine-tuning, ECLIP surpasses existing methods by a large margin on a broad range of downstream tasks, demonstrating the strong transferability to real-world E-commerce applications.

翻译：本文旨在构建一个通用的多模态基础模型，使其具备可扩展能力，以支撑电商领域的海量下游应用。近年来，大规模视觉-语言预训练方法在通用领域取得了显著进展。然而，由于自然图像与商品图像之间存在显著差异，直接将这些基于图像级表征建模的框架应用于电商场景将不可避免地导致性能欠佳。为此，本文提出了一种以实例为中心的多模态预训练范式——ECLIP。具体而言，我们设计了一种解码器架构，通过引入一组可学习的实例查询来显式聚合实例级语义。此外，为使模型能够聚焦于目标商品实例而无需依赖昂贵的人工标注，本文进一步提出了两个经过特殊配置的预训练任务。基于1亿条电商相关数据的预训练，ECLIP成功提取出更具通用性、语义更丰富且更鲁棒的表征。大量实验结果表明，无需额外微调，ECLIP在广泛的下游任务上以显著优势超越现有方法，展现出其在实际电商应用中的强迁移能力。

相关内容

电子商务

关注 2

电子商务（ Electronic Commerce）的定义： 电子商务是利用计算机技术、网络技术和远程通信技术，实现电子化、数字化和网络化的整个商务过程。　　联合国国际贸易程序简化工作组对电子商务的定义是：采用电子形式开展商务活动，它包括在供应商、客户、政府及其他参与方之间通过任何电子工具，如 EDI、 Web技术、电子邮件等共享非结构化商务信息，并管理和完成在商务活动、管理活动和消费活动中的各种交易。

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

专知会员服务

32+阅读 · 2022年3月12日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【伯克利】元学习的元基线，A New Meta-Baseline for Few-Shot Learning

专知会员服务

67+阅读 · 2020年3月28日