Semantic ID learning is a key interface in Generative Recommendation (GR) models, mapping items to discrete identifiers grounded in side information, most commonly via a pretrained text encoder. However, these text encoders are primarily optimized for well-formed natural language. In real-world recommendation data, item descriptions are often symbolic and attribute-centric, containing numerals, units, and abbreviations. These text encoders can break these signals into fragmented tokens, weakening semantic coherence and distorting relationships among attributes. Worse still, when moving to multimodal GR, relying on standard text encoders introduces an additional obstacle: text and image embeddings often exhibit mismatched geometric structures, making cross-modal fusion less effective and less stable. In this paper, we revisit representation design for Semantic ID learning by treating text as a visual signal. We conduct a systematic empirical study of OCR-based text representations, obtained by rendering item descriptions into images and encoding them with vision-based OCR models. Experiments across four datasets and two generative backbones show that OCR-text consistently matches or surpasses standard text embeddings for Semantic ID learning in both unimodal and multimodal settings. Furthermore, we find that OCR-based Semantic IDs remain robust under extreme spatial-resolution compression, indicating strong robustness and efficiency in practical deployments.
翻译:语义ID学习是生成式推荐模型中的关键接口,它通过侧信息将物品映射到离散标识符,最常见的方式是使用预训练的文本编码器。然而,这些文本编码器主要针对结构良好的自然语言进行优化。在现实世界的推荐数据中,物品描述通常是符号化的、以属性为中心的,包含数字、单位和缩写。这些文本编码器可能会将这些信号分解为碎片化的标记,削弱语义连贯性并扭曲属性之间的关系。更糟糕的是,当转向多模态生成式推荐时,依赖标准文本编码器会引入额外的障碍:文本和图像嵌入通常表现出不匹配的几何结构,使得跨模态融合效果较差且稳定性不足。在本文中,我们通过将文本视为视觉信号,重新审视语义ID学习的表示设计。我们对基于OCR的文本表示进行了系统的实证研究,该方法通过将物品描述渲染成图像并使用基于视觉的OCR模型进行编码来获得。在四个数据集和两种生成式骨干模型上的实验表明,无论是在单模态还是多模态设置下,基于OCR的文本表示在语义ID学习任务上始终匹配或超越标准文本嵌入。此外,我们发现基于OCR的语义ID在极端空间分辨率压缩下仍保持鲁棒性,这表明其在实践部署中具有强大的鲁棒性和效率。