Modern e-commerce search is inherently multimodal: customers make purchase decisions by jointly considering product text and visual informations. However, most industrial retrieval and ranking systems primarily rely on textual information, underutilizing the rich visual signals available in product images. In this work, we study unified text-image fusion for two-tower retrieval models in the e-commerce domain. We demonstrate that domain-specific fine-tuning and two stage alignment between query with product text and image modalities are both crucial for effective multimodal retrieval. Building on these insights, we propose a noval modality fusion network to fuse image and text information and capture cross-modal complementary information. Experiments on large-scale e-commerce datasets validate the effectiveness of the proposed approach.
翻译:现代电商搜索本质上是多模态的:顾客通过综合考虑产品文本与视觉信息来做出购买决策。然而,大多数工业级检索与排序系统主要依赖文本信息,未能充分利用产品图像中丰富的视觉信号。在本工作中,我们研究了电商领域中双塔检索模型的统一文本-图像融合方法。我们证明了领域特定的微调以及查询与产品文本、图像模态之间的两阶段对齐对于有效的多模态检索都至关重要。基于这些见解,我们提出了一种新颖的模态融合网络,用于融合图像与文本信息并捕捉跨模态的互补信息。在大规模电商数据集上的实验验证了所提方法的有效性。