Sequential recommendation plays a critical role in modern online platforms such as e-commerce, advertising, and content streaming, where accurately predicting users' next interactions is essential for personalization. Recent Transformer-based methods like BERT4Rec have shown strong modeling capability, yet they still rely on discrete item IDs that lack semantic meaning and ignore rich multimodal information (e.g., text and image). This leads to weak generalization and limited interpretability. To address these challenges, we propose Q-Bert4Rec, a multimodal sequential recommendation framework that unifies semantic representation and quantized modeling. Specifically, Q-Bert4Rec consists of three stages: (1) cross-modal semantic injection, which enriches randomly initialized ID embeddings through a dynamic transformer that fuses textual, visual, and structural features; (2) semantic quantization, which discretizes fused representations into meaningful tokens via residual vector quantization; and (3) multi-mask pretraining and fine-tuning, which leverage diverse masking strategies -- span, tail, and multi-region -- to improve sequential understanding. We validate our model on public Amazon benchmarks and demonstrate that Q-Bert4Rec significantly outperforms many strong existing methods, confirming the effectiveness of semantic tokenization for multimodal sequential recommendation. Our source code will be publicly available on GitHub after publishing.
翻译:序列推荐在现代在线平台(如电子商务、广告和内容流媒体)中扮演着关键角色,其中准确预测用户的下一次交互对于个性化至关重要。近期基于Transformer的方法(如BERT4Rec)已展现出强大的建模能力,但它们仍依赖于缺乏语义信息的离散物品ID,并忽略了丰富的多模态信息(如文本和图像)。这导致模型泛化能力较弱且可解释性有限。为应对这些挑战,我们提出了Q-Bert4Rec——一个统一语义表示与量化建模的多模态序列推荐框架。具体而言,Q-Bert4Rec包含三个阶段:(1)跨模态语义注入:通过动态Transformer融合文本、视觉和结构特征,从而增强随机初始化的ID嵌入;(2)语义量化:通过残差向量量化将融合后的表示离散化为有意义的标记;(3)多掩码预训练与微调:利用多样化的掩码策略(跨度掩码、尾部掩码和多区域掩码)以提升序列理解能力。我们在公开的Amazon基准数据集上验证了模型性能,结果表明Q-Bert4Rec显著优于多种现有强基线方法,证实了语义标记化技术在多模态序列推荐中的有效性。我们的源代码将在论文发表后于GitHub平台公开。