Incorporating item content information into click-through rate (CTR) prediction models remains a challenge, especially with the time and space constraints of industrial scenarios. The content-encoding paradigm, which integrates user and item encoders directly into CTR models, prioritizes space over time. In contrast, the embedding-based paradigm transforms item and user semantics into latent embeddings and then caches them, prioritizes space over time. In this paper, we introduce a new semantic-token paradigm and propose a discrete semantic tokenization approach, namely UIST, for user and item representation. UIST facilitates swift training and inference while maintaining a conservative memory footprint. Specifically, UIST quantizes dense embedding vectors into discrete tokens with shorter lengths and employs a hierarchical mixture inference module to weigh the contribution of each user--item token pair. Our experimental results on news recommendation showcase the effectiveness and efficiency (about 200-fold space compression) of UIST for CTR prediction.
翻译:将商品内容信息融入点击率(CTR)预测模型仍是一项挑战,尤其在工业场景的时间和空间约束下。内容编码范式将用户和商品编码器直接集成到CTR模型中,优先考虑空间而非时间。相比之下,基于嵌入的范式将商品和用户语义转化为潜在嵌入并进行缓存,优先考虑时间而非空间。本文提出了一种新的语义分词范式,并引入了一种名为UIST的离散语义分词方法,用于用户和商品表示。UIST在保持较小内存占用的同时,实现了快速训练与推理。具体而言,UIST将稠密嵌入向量量化为长度更短的离散分词,并采用层次化混合推理模块来权衡每个用户-商品分词对的贡献。我们在新闻推荐上的实验结果展示了UIST在CTR预测中的有效性和高效性(约200倍空间压缩)。