Mixer: Image to Multi-Modal Retrieval Learning for Industrial Application

Cross-modal retrieval, where the query is an image and the doc is an item with both image and text description, is ubiquitous in e-commerce platforms and content-sharing social media. However, little research attention has been paid to this important application. This type of retrieval task is challenging due to the facts: 1)~domain gap exists between query and doc. 2)~multi-modality alignment and fusion. 3)~skewed training data and noisy labels collected from user behaviors. 4)~huge number of queries and timely responses while the large-scale candidate docs exist. To this end, we propose a novel scalable and efficient image query to multi-modal retrieval learning paradigm called Mixer, which adaptively integrates multi-modality data, mines skewed and noisy data more efficiently and scalable to high traffic. The Mixer consists of three key ingredients: First, for query and doc image, a shared encoder network followed by separate transformation networks are utilized to account for their domain gap. Second, in the multi-modal doc, images and text are not equally informative. So we design a concept-aware modality fusion module, which extracts high-level concepts from the text by a text-to-image attention mechanism. Lastly, but most importantly, we turn to a new data organization and training paradigm for single-modal to multi-modal retrieval: large-scale classification learning which treats single-modal query and multi-modal doc as equivalent samples of certain classes. Besides, the data organization follows a weakly-supervised manner, which can deal with skewed data and noisy labels inherited in the industrial systems. Learning such a large number of categories for real-world multi-modality data is non-trivial and we design a specific learning strategy for it. The proposed Mixer achieves SOTA performance on public datasets from industrial retrieval systems.

翻译：跨模态检索（查询为图像，文档为同时包含图像和文本描述的多模态条目）在电子商务平台和内容分享型社交媒体中普遍存在。然而，这一重要应用场景鲜少受到学术界关注。此类检索任务面临以下挑战：1）查询与文档之间存在领域差异；2）多模态对齐与融合问题；3）训练数据倾斜及用户行为采集的噪声标签；4）海量查询请求与大规模候选文档的实时响应需求。为此，我们提出一种名为Mixer的通用可扩展图像查询到多模态检索学习范式，该范式能够自适应整合多模态数据，更高效地挖掘倾斜噪声数据，并支持高流量场景下的可扩展部署。Mixer包含三大核心组件：首先，针对查询图像与文档图像，采用共享编码器网络后接独立转换网络的结构以弥合领域差异；其次，多模态文档中图像与文本信息量并不对等，为此设计概念感知模态融合模块，通过文本到图像的注意力机制从文本中提取高层概念；最后，也是最重要的创新，我们提出面向单模态到多模态检索的新型数据组织与训练范式：采用大规模分类学习策略，将单模态查询与多模态文档视为特定类别的等价样本。此外，该数据组织遵循弱监督方式，可有效处理工业系统中固有的数据倾斜与噪声标签问题。针对真实世界多模态数据的大规模类别学习并非易事，为此我们设计了专用学习策略。所提出的Mixer在工业检索系统的公开数据集上实现了最优性能。