面向美团外卖的多模态生成式检索模型及其分阶段预训练策略 (Multimodal Generative Retrieval Model with Staged Pretraining for Food Delivery on Meituan)

Multimodal retrieval models are becoming increasingly important in scenarios such as food delivery, where rich multimodal features can meet diverse user needs and enable precise retrieval. Mainstream approaches typically employ a dual-tower architecture between queries and items, and perform joint optimization of intra-tower and inter-tower tasks. However, we observe that joint optimization often leads to certain modalities dominating the training process, while other modalities are neglected. In addition, inconsistent training speeds across modalities can easily result in the one-epoch problem. To address these challenges, we propose a staged pretraining strategy, which guides the model to focus on specialized tasks at each stage, enabling it to effectively attend to and utilize multimodal features, and allowing flexible control over the training process at each stage to avoid the one-epoch problem. Furthermore, to better utilize the semantic IDs that compress high-dimensional multimodal embeddings, we design both generative and discriminative tasks to help the model understand the associations between SIDs, queries, and item features, thereby improving overall performance. Extensive experiments on large-scale real-world Meituan data demonstrate that our method achieves improvements of 3.80%, 2.64%, and 2.17% on R@5, R@10, and R@20, and 5.10%, 4.22%, and 2.09% on N@5, N@10, and N@20 compared to mainstream baselines. Online A/B testing on the Meituan platform shows that our approach achieves a 1.12% increase in revenue and a 1.02% increase in click-through rate, validating the effectiveness and superiority of our method in practical applications.

翻译：多模态检索模型在外卖等场景中日益重要，丰富的多模态特征能够满足用户多样化需求并实现精准检索。主流方法通常在查询与商品之间采用双塔架构，并对塔内任务与塔间任务进行联合优化。然而，我们观察到联合优化往往导致某些模态主导训练过程，而其他模态被忽视。此外，不同模态间训练速度的不一致容易引发单轮次训练问题。为应对这些挑战，我们提出一种分阶段预训练策略，该策略引导模型在各阶段专注于特定任务，使其能够有效关注并利用多模态特征，并允许灵活控制各阶段的训练过程以避免单轮次问题。进一步地，为更好地利用压缩高维多模态嵌入的语义ID（SID），我们设计了生成式与判别式任务，以帮助模型理解SID、查询与商品特征之间的关联，从而提升整体性能。基于美团大规模真实数据的大量实验表明，相较于主流基线方法，我们的方法在R@5、R@10与R@20指标上分别提升了3.80%、2.64%与2.17%，在N@5、N@10与N@20指标上分别提升了5.10%、4.22%与2.09%。在美团平台上的在线A/B测试显示，我们的方法实现了收入提升1.12%与点击率提升1.02%，验证了该方法在实际应用中的有效性与优越性。