E-commerce search engines comprise a retrieval phase and a ranking phase, where the first one returns a candidate product set given user queries. Recently, vision-language pre-training, combining textual information with visual clues, has been popular in the application of retrieval tasks. In this paper, we propose a novel V+L pre-training method to solve the retrieval problem in Taobao Search. We design a visual pre-training task based on contrastive learning, outperforming common regression-based visual pre-training tasks. In addition, we adopt two negative sampling schemes, tailored for the large-scale retrieval task. Besides, we introduce the details of the online deployment of our proposed method in real-world situations. Extensive offline/online experiments demonstrate the superior performance of our method on the retrieval task. Our proposed method is employed as one retrieval channel of Taobao Search and serves hundreds of millions of users in real time.
翻译:电子商务搜索引擎包含检索和排序两个阶段,其中第一阶段根据用户查询返回候选产品集合。近年来,结合文本信息与视觉线索的视觉-语言预训练技术已广泛应用于检索任务中。本文针对淘宝搜索的检索问题,提出一种新颖的视觉-语言(V+L)预训练方法。我们设计了一种基于对比学习的视觉预训练任务,其性能优于常见的基于回归的视觉预训练任务。此外,我们针对大规模检索任务采用了两种负采样方案。同时详细介绍了所提方法在真实场景中的线上部署细节。大量离线/在线实验表明,我们的方法在检索任务上具有卓越性能。目前该方法已作为淘宝搜索的一个检索通道,实时服务于数亿用户。