Amazon-M2: A Multilingual Multi-locale Shopping Session Dataset for Recommendation and Text Generation

Wei Jin,Haitao Mao,Zheng Li,Haoming Jiang,Chen Luo,Hongzhi Wen,Haoyu Han,Hanqing Lu,Zhengyang Wang,Ruirui Li,Zhen Li,Monica Xiao Cheng,Rahul Goutam,Haiyang Zhang,Karthik Subbian,Suhang Wang,Yizhou Sun,Jiliang Tang,Bing Yin,Xianfeng Tang

from arxiv, Dataset for KDD Cup 2023, https://kddcup23.github.io/

Modeling customer shopping intentions is a crucial task for e-commerce, as it directly impacts user experience and engagement. Thus, accurately understanding customer preferences is essential for providing personalized recommendations. Session-based recommendation, which utilizes customer session data to predict their next interaction, has become increasingly popular. However, existing session datasets have limitations in terms of item attributes, user diversity, and dataset scale. As a result, they cannot comprehensively capture the spectrum of user behaviors and preferences. To bridge this gap, we present the Amazon Multilingual Multi-locale Shopping Session Dataset, namely Amazon-M2. It is the first multilingual dataset consisting of millions of user sessions from six different locales, where the major languages of products are English, German, Japanese, French, Italian, and Spanish. Remarkably, the dataset can help us enhance personalization and understanding of user preferences, which can benefit various existing tasks as well as enable new tasks. To test the potential of the dataset, we introduce three tasks in this work: (1) next-product recommendation, (2) next-product recommendation with domain shifts, and (3) next-product title generation. With the above tasks, we benchmark a range of algorithms on our proposed dataset, drawing new insights for further research and practice. In addition, based on the proposed dataset and tasks, we hosted a competition in the KDD CUP 2023 and have attracted thousands of users and submissions. The winning solutions and the associated workshop can be accessed at our website https://kddcup23.github.io/.

翻译：建模顾客购物意图是电子商务的关键任务，直接影响用户体验与用户参与度。因此，准确理解顾客偏好对于提供个性化推荐至关重要。基于会话的推荐方法利用顾客会话数据预测其下一次交互，已日益普及。然而，现有会话数据集在商品属性、用户多样性和数据集规模方面存在局限性，无法全面捕捉用户行为与偏好的全貌。为弥补这一不足，我们提出亚马逊多语言多地区购物会话数据集（Amazon-M2）。这是首个包含来自六个不同地区的数百万用户会话的多语言数据集，其中商品的主要语言为英语、德语、日语、法语、意大利语和西班牙语。值得注意的是，该数据集有助于增强个性化推荐与用户偏好理解，既可惠及现有多种任务，也能支持新任务开发。为检验该数据集的潜力，本文引入三项任务：（1）下一商品推荐，（2）跨领域下一商品推荐，（3）下一商品标题生成。基于上述任务，我们在所提数据集上对多种算法进行基准测试，为后续研究与实践提供新见解。此外，基于该数据集与任务，我们在KDD CUP 2023中举办竞赛，吸引数千名用户参与提交。优胜解决方案及相关研讨会信息可访问我们网站https://kddcup23.github.io/。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日