Product embedding serves as a cornerstone for a wide range of applications in eCommerce. The product embedding learned from multiple modalities shows significant improvement over that from a single modality, since different modalities provide complementary information. However, some modalities are more informatively dominant than others. How to teach a model to learn embedding from different modalities without neglecting information from the less dominant modality is challenging. We present an image-text embedding model (ITEm), an unsupervised learning method that is designed to better attend to image and text modalities. We extend BERT by (1) learning an embedding from text and image without knowing the regions of interest; (2) training a global representation to predict masked words and to construct masked image patches without their individual representations. We evaluate the pre-trained ITEm on two tasks: the search for extremely similar products and the prediction of product categories, showing substantial gains compared to strong baseline models.
翻译:产品嵌入是电子商务中众多应用的基石。从多模态学习的嵌入相比单模态有显著提升,因为不同模态提供了互补信息。然而,某些模态在信息量上更具主导性。如何训练模型从不同模态中学习嵌入,同时不忽略弱势模态的信息,是一项挑战。我们提出一种图像-文本嵌入模型(ITEm),这是一种无监督学习方法,旨在更好地关注图像和文本模态。我们对BERT进行扩展:(1)在未知感兴趣区域的情况下,从文本和图像中学习嵌入;(2)训练全局表征来预测掩码词,并在不需要个体表征的情况下构建掩码图像块。我们在两项任务上评估预训练的ITEm:极度相似产品的搜索和产品类别预测,结果显示相比强基线模型取得了显著提升。