Taobao Search consists of two phases: the retrieval phase and the ranking phase. Given a user query, the retrieval phase returns a subset of candidate products for the following ranking phase. Recently, the paradigm of pre-training and fine-tuning has shown its potential in incorporating visual clues into retrieval tasks. In this paper, we focus on solving the problem of text-to-multimodal retrieval in Taobao Search. We consider that users' attention on titles or images varies on products. Hence, we propose a novel Modal Adaptation module for cross-modal fusion, which helps assigns appropriate weights on texts and images across products. Furthermore, in e-commerce search, user queries tend to be brief and thus lead to significant semantic imbalance between user queries and product titles. Therefore, we design a separate text encoder and a Keyword Enhancement mechanism to enrich the query representations and improve text-to-multimodal matching. To this end, we present a novel vision-language (V+L) pre-training methods to exploit the multimodal information of (user query, product title, product image). Extensive experiments demonstrate that our retrieval-specific pre-training model (referred to as MAKE) outperforms existing V+L pre-training methods on the text-to-multimodal retrieval task. MAKE has been deployed online and brings major improvements on the retrieval system of Taobao Search.
翻译:淘宝搜索包含两个阶段:检索阶段与排序阶段。给定用户查询后,检索阶段为后续排序阶段返回候选商品子集。近年来,预训练与微调范式在将视觉线索融入检索任务方面展现出潜力。本文聚焦解决淘宝搜索中的文本到多模态检索问题。我们注意到用户对标题或图像的关注度因商品而异,因此提出了一种新颖的模态自适应模块用于跨模态融合,该模块能够为不同商品的文本与图像分配适当权重。此外,在电商搜索中,用户查询通常较为简短,导致用户查询与商品标题之间存在显著的语义不平衡。为此,我们设计了独立的文本编码器与关键词增强机制,以丰富查询表示并提升文本到多模态的匹配效果。基于此,我们提出了一种新颖的视觉-语言(V+L)预训练方法,以挖掘(用户查询、商品标题、商品图像)中的多模态信息。大量实验表明,我们面向检索的预训练模型(称为MAKE)在文本到多模态检索任务上优于现有V+L预训练方法。MAKE已在线部署,并为淘宝搜索检索系统带来了显著提升。