MAP: A Model-agnostic Pretraining Framework for Click-through Rate Prediction

With the widespread application of personalized online services, click-through rate (CTR) prediction has received more and more attention and research. The most prominent features of CTR prediction are its multi-field categorical data format, and vast and daily-growing data volume. The large capacity of neural models helps digest such massive amounts of data under the supervised learning paradigm, yet they fail to utilize the substantial data to its full potential, since the 1-bit click signal is not sufficient to guide the model to learn capable representations of features and instances. The self-supervised learning paradigm provides a more promising pretrain-finetune solution to better exploit the large amount of user click logs, and learn more generalized and effective representations. However, self-supervised learning for CTR prediction is still an open question, since current works on this line are only preliminary and rudimentary. To this end, we propose a Model-agnostic pretraining (MAP) framework that applies feature corruption and recovery on multi-field categorical data, and more specifically, we derive two practical algorithms: masked feature prediction (MFP) and replaced feature detection (RFD). MFP digs into feature interactions within each instance through masking and predicting a small portion of input features, and introduces noise contrastive estimation (NCE) to handle large feature spaces. RFD further turns MFP into a binary classification mode through replacing and detecting changes in input features, making it even simpler and more effective for CTR pretraining. Our extensive experiments on two real-world large-scale datasets (i.e., Avazu, Criteo) demonstrate the advantages of these two methods on several strong backbones (e.g., DCNv2, DeepFM), and achieve new state-of-the-art performance in terms of both effectiveness and efficiency for CTR prediction.

翻译：随着个性化在线服务的广泛应用，点击率预测（CTR Prediction）受到了越来越多的关注与研究。CTR预测最显著的特点在于其多字段分类数据格式，以及海量且日增的数据量。神经模型的大容量有助于在监督学习范式下消化这些海量数据，然而它们未能充分利用这些数据的潜力，因为1比特的点击信号不足以指导模型学习特征和实例的有效表示。自监督学习范式提供了一种更有前景的预训练-微调解决方案，以更好地利用大量用户点击日志，并学习更具泛化性和有效性的表示。然而，CTR预测的自监督学习仍是一个开放问题，因为当前该方向的研究仅处于初步和基础阶段。为此，我们提出了一种与模型无关的预训练（MAP）框架，该框架对多字段分类数据进行特征破坏与恢复，具体而言，我们推导出两种实用算法：掩码特征预测（MFP）和替换特征检测（RFD）。MFP通过掩码和预测少量输入特征来深入挖掘每个实例内的特征交互，并引入噪声对比估计（NCE）以处理大规模特征空间。RFD进一步将MFP转化为二元分类模式，通过替换和检测输入特征的变化，使其对CTR预训练更加简单有效。我们在两个真实大规模数据集（即Avazu、Criteo）上的大量实验表明，这两种方法在多个强骨干网络（例如DCNv2、DeepFM）上具有优势，并在CTR预测的有效性和效率方面均达到了新的最先进性能。