With the development of the multi-media internet, visual characteristics have become an important factor affecting user interests. Thus, incorporating visual features is a promising direction for further performance improvements in click-through rate (CTR) prediction. However, we found that simply injecting the image embeddings trained with established pre-training methods only has marginal improvements. We attribute the failure to two reasons: First, The pre-training methods are designed for well-defined computer vision tasks concentrating on semantic features, and they cannot learn personalized interest in recommendations. Secondly, pre-trained image embeddings only containing semantic information have little information gain, considering we already have semantic features such as categories and item titles as inputs in the CTR prediction task. We argue that a pre-training method tailored for recommendation is necessary for further improvements. To this end, we propose a recommendation-aware image pre-training method that can learn visual features from user click histories. Specifically, we propose a user interest reconstruction module to mine visual features related to user interests from behavior histories. We further propose a contrastive training method to avoid collapsing of embedding vectors. We conduct extensive experiments to verify that our method can learn users' visual interests, and our method achieves $0.46\%$ improvement in offline AUC and $0.88\%$ improvement in Taobao online GMV with p-value$<0.01$.
翻译:随着多媒体互联网的发展,视觉特征已成为影响用户兴趣的重要因素。因此,融入视觉特征是进一步提升点击率预测性能的有前景方向。然而,我们发现直接注入基于现有预训练方法获得的图像嵌入仅能带来微小改进。我们将此归因于两个原因:首先,现有预训练方法针对定义明确的计算机视觉任务设计,侧重于语义特征,无法学习推荐场景中的个性化兴趣。其次,考虑到在CTR预测任务中已输入类目和商品标题等语义特征,仅含语义信息的预训练图像嵌入几乎无法带来信息增益。我们主张,为实现进一步提升,亟需面向推荐场景定制的预训练方法。为此,我们提出一种推荐感知的图像预训练方法,可从用户点击历史中学习视觉特征。具体而言,我们提出用户兴趣重建模块,从行为历史中挖掘与用户兴趣相关的视觉特征。进一步提出对比训练方法以避免嵌入向量坍塌。大量实验证明,我们的方法能够学习用户视觉兴趣,在离线AUC指标上提升0.46%,淘宝在线GMV提升0.88%(p值<0.01)。