Incorporating item content information into click-through rate (CTR) prediction models remains a challenge, especially with the time and space constraints of industrial scenarios. The content-encoding paradigm, which integrates user and item encoders directly into CTR models, prioritizes space over time. In contrast, the embedding-based paradigm transforms item and user semantics into latent embeddings, subsequently caching them to optimize processing time at the expense of space. In this paper, we introduce a new semantic-token paradigm and propose a discrete semantic tokenization approach, namely UIST, for user and item representation. UIST facilitates swift training and inference while maintaining a conservative memory footprint. Specifically, UIST quantizes dense embedding vectors into discrete tokens with shorter lengths and employs a hierarchical mixture inference module to weigh the contribution of each user--item token pair. Our experimental results on news recommendation showcase the effectiveness and efficiency (about 200-fold space compression) of UIST for CTR prediction.
翻译:将物品内容信息融入点击率(CTR)预测模型仍是一项挑战,尤其是在工业场景下受时间和空间约束的情况下。内容编码范式将用户编码器和物品编码器直接集成到CTR模型中,优先考虑空间效率而非时间效率。而基于嵌入的范式则将物品和用户语义转化为潜在嵌入向量,随后缓存这些向量以优化处理时间,但牺牲了空间效率。本文提出了一种新的语义分词范式,并引入了一种名为UIST的离散语义分词方法,用于用户和物品表征。UIST在保持较小内存占用的同时,实现了快速的训练与推理。具体而言,UIST将稠密嵌入向量量化为长度更短的离散分词,并采用层次化混合推理模块来权衡每个用户-物品分词对的贡献。我们在新闻推荐任务上的实验结果展示了UIST在CTR预测中的有效性和高效性(空间压缩约200倍)。