AntM$^{2}$C: A Large Scale Dataset For Multi-Scenario Multi-Modal CTR Prediction

Click-through rate (CTR) prediction is a crucial issue in recommendation systems. There has been an emergence of various public CTR datasets. However, existing datasets primarily suffer from the following limitations. Firstly, users generally click different types of items from multiple scenarios, and modeling from multiple scenarios can provide a more comprehensive understanding of users. Existing datasets only include data for the same type of items from a single scenario. Secondly, multi-modal features are essential in multi-scenario prediction as they address the issue of inconsistent ID encoding between different scenarios. The existing datasets are based on ID features and lack multi-modal features. Third, a large-scale dataset can provide a more reliable evaluation of models, fully reflecting the performance differences between models. The scale of existing datasets is around 100 million, which is relatively small compared to the real-world CTR prediction. To address these limitations, we propose AntM$^{2}$C, a Multi-Scenario Multi-Modal CTR dataset based on industrial data from Alipay. Specifically, AntM$^{2}$C provides the following advantages: 1) It covers CTR data of 5 different types of items, providing insights into the preferences of users for different items, including advertisements, vouchers, mini-programs, contents, and videos. 2) Apart from ID-based features, AntM$^{2}$C also provides 2 multi-modal features, raw text and image features, which can effectively establish connections between items with different IDs. 3) AntM$^{2}$C provides 1 billion CTR data with 200 features, including 200 million users and 6 million items. It is currently the largest-scale CTR dataset available. Based on AntM$^{2}$C, we construct several typical CTR tasks and provide comparisons with baseline methods. The dataset homepage is available at https://www.atecup.cn/home.

翻译：点击率（CTR）预测是推荐系统的关键问题。目前已有多种公开的CTR数据集，但现有数据集主要存在以下局限：第一，用户通常会在多个场景中点击不同类型的物品，而多场景建模能更全面地理解用户行为，但现有数据集仅包含单一场景下同类型物品的数据；第二，多模态特征在多场景预测中至关重要，能解决不同场景间ID编码不一致的问题，而现有数据集基于ID特征，缺乏多模态特征；第三，大规模数据集能更可靠地评估模型性能，充分反映模型间的性能差异，现有数据集规模约1亿条，远小于实际CTR预测场景。为解决上述问题，我们提出基于支付宝工业数据的多场景多模态CTR数据集AntM$^{2}$C。具体优势包括：1）覆盖广告、优惠券、小程序、内容、视频5类不同物品的CTR数据，揭示用户对不同物品的偏好；2）除基于ID的特征外，还提供原始文本和图像特征两种多模态特征，有效建立不同ID物品间的关联；3）包含10亿条CTR数据、200维特征、2亿用户和600万物品，是目前规模最大的CTR数据集。基于AntM$^{2}$C，我们构建了多项典型CTR任务并提供基线方法对比。数据集主页参见https://www.atecup.cn/home。