CARD: Non-Uniform Quantization of Visual Semantic Unit for Generative Recommendation

Generative recommendation frameworks typically represent items as discrete Semantic IDs (SIDs). While existing studies have sought to enhance SID construction by incorporating multimodal content, collaborative signals, or more advanced quantization techniques, learning high-quality SIDs still faces two key challenges: (1) The two-stage generative recommendation paradigm (SID construction and autoregressive generation) provides insufficient supervision for heterogeneous fusion, which hinders learning high-quality SIDs, and (2) non-uniform embeddings lead to codeword imbalance and generation bias. To address these challenges, we propose a novel generative recommendation framework, called CARD. CARD introduces a visual semantic unit that unifies textual, visual, and collaborative signals into a structured visual representation prior to encoding, enabling holistic semantic modeling and effectively alleviating the semantic gap, thereby reducing the reliance on supervision signals during SID learning. Furthermore, to deal with the highly non-uniform distribution of item semantic embeddings in recommendation scenarios, we develop a non-uniform quantization framework (NU-RQ-VAE), which incorporates a learnable and invertible non-uniform transformation into the quantization process to map skewed semantic distributions into a more balanced latent space, thereby significantly improving codebook utilization and quantization accuracy. Experiments on multiple datasets show that CARD consistently outperforms baseline methods under various settings; meanwhile, the proposed non-uniform transformation module is plug-and-play and remains robust across different quantization schemes. Code is available at https://github.com/HAI-UESTC/CARD.

翻译：生成式推荐框架通常将物品表示为离散的语义ID（Semantic IDs，SIDs）。尽管现有研究通过引入多模态内容、协同信号或更先进的量化技术来改进SID构建，但学习高质量的SID仍面临两大挑战：（1）两阶段生成式推荐范式（SID构建与自回归生成）对异构融合的监督不足，阻碍了高质量SID的学习；（2）非均匀嵌入导致码本失衡与生成偏差。针对这些问题，本文提出一种新型生成式推荐框架CARD。CARD在编码前引入视觉语义单元，将文本、视觉与协同信号统一为结构化视觉表示，实现整体语义建模，有效缓解语义鸿沟，从而降低SID学习过程中对监督信号的依赖。此外，为应对推荐场景中物品语义嵌入的高度非均匀分布，我们开发了非均匀量化框架（NU-RQ-VAE），该框架在量化过程中引入可学习且可逆的非均匀变换，将偏斜语义分布映射至更均衡的隐空间，从而显著提升码本利用率与量化精度。在多个数据集上的实验表明，CARD在多种设置下均一致优于基线方法；同时，所提出的非均匀变换模块具有即插即用特性，且在不同量化方案下保持鲁棒性。代码已开源：https://github.com/HAI-UESTC/CARD。