Item cold-start is a pervasive challenge for collaborative filtering (CF) recommender systems. Existing methods often train cold-start models by mapping auxiliary item content, such as images or text descriptions, into the embedding space of a CF model. However, such approaches can be limited by the fundamental information gap between CF signals and content features. In this work, we propose to avoid this limitation with purely content-based modeling of cold items, i.e. without alignment with CF user or item embeddings. We instead frame cold-start prediction in terms of item-item similarity, training a content encoder to project into a latent space where similarity correlates with user preferences. We define our training objective as a sparse generalization of sampled softmax loss with the $α$-entmax family of activation functions, which allows for sharper estimation of item relevance by zeroing gradients for uninformative negatives. We then describe how this Sampled Entmax for Cold-start (SEMCo) training regime can be extended via knowledge distillation, and show that it outperforms existing cold-start methods and standard sampled softmax in ranking accuracy. We also discuss the advantages of purely content-based modeling, particularly in terms of equity of item outcomes.
翻译:物品冷启动是协同过滤推荐系统面临的普遍挑战。现有方法通常通过将物品辅助内容(如图像或文本描述)映射到协同过滤模型的嵌入空间来训练冷启动模型。然而,这类方法可能受限于协同过滤信号与内容特征之间的根本性信息鸿沟。本文提出通过纯基于内容的冷启动物品建模来规避这一局限,即无需与协同过滤用户或物品嵌入对齐。我们转而以物品-物品相似度来框架冷启动预测,训练内容编码器将内容投影到相似度与用户偏好相关的潜在空间中。我们将训练目标定义为采样softmax损失的稀疏泛化,采用α-entmax激活函数族,该函数族能通过为零梯度消除无信息负样本,从而实现更精准的物品相关性估计。接着,我们阐述如何通过知识蒸馏扩展该冷启动采样熵最大(SEMCo)训练机制,并证明其在排序精度上优于现有冷启动方法及标准采样softmax。最后,我们探讨纯基于内容建模的优势,特别是在物品结果公平性方面的优势。