The proliferation of short video and live-streaming platforms has revolutionized how consumers engage in online shopping. Instead of browsing product pages, consumers are now turning to rich-content e-commerce, where they can purchase products through dynamic and interactive media like short videos and live streams. This emerging form of online shopping has introduced technical challenges, as products may be presented differently across various media domains. Therefore, a unified product representation is essential for achieving cross-domain product recognition to ensure an optimal user search experience and effective product recommendations. Despite the urgent industrial need for a unified cross-domain product representation, previous studies have predominantly focused only on product pages without taking into account short videos and live streams. To fill the gap in the rich-content e-commerce area, in this paper, we introduce a large-scale cRoss-dOmain Product Ecognition dataset, called ROPE. ROPE covers a wide range of product categories and contains over 180,000 products, corresponding to millions of short videos and live streams. It is the first dataset to cover product pages, short videos, and live streams simultaneously, providing the basis for establishing a unified product representation across different media domains. Furthermore, we propose a Cross-dOmain Product rEpresentation framework, namely COPE, which unifies product representations in different domains through multimodal learning including text and vision. Extensive experiments on downstream tasks demonstrate the effectiveness of COPE in learning a joint feature space for all product domains.
翻译:短视频与直播平台的蓬勃发展彻底改变了消费者的线上购物方式。消费者不再局限于浏览商品页面,而是转向通过短视频、直播等动态交互媒介进行商品购买的富内容电商模式。这种新型在线购物形态带来了技术挑战:商品在不同媒体领域中可能呈现迥异的形态。因此,要实现跨领域商品识别、确保最佳用户搜索体验和精准商品推荐,亟需构建统一的商品表征。尽管工业界对统一跨域商品表征存在迫切需求,现有研究却主要聚焦于商品页面,未能将短视频与直播纳入考量。为填补富内容电商领域的研究空白,本文提出了大规模跨域商品识别数据集ROPE。该数据集覆盖广泛商品品类,包含超过18万件商品及其对应的数百万条短视频与直播内容,是首个同时涵盖商品页面、短视频和直播的数据集,为建立跨媒体领域的统一商品表征奠定基础。此外,我们提出了跨域商品表征框架COPE,通过文本与视觉的多模态学习实现不同领域商品表征的统一。下游任务的大量实验证明,COPE在学习所有商品域联合特征空间方面具有显著有效性。