While multimodal recommendation models have effectively integrated visual and textual information, their reliance on unique ID embeddings constitutes a fundamental performance bottleneck. Specifically, ID-based paradigms suffer from three limitations: (1) \textbf{Information Isolation}, where unique IDs prevent semantic information exchange among related items; (2) \textbf{Cold-Start Vulnerability}, as ID embeddings are difficult to optimize with sparse interactions; and (3) \textbf{Storage Inefficiency}, where parameter costs scale linearly with item quantity. To overcome these challenges, we propose \textbf{MOTOR}, a novel \textbf{ID-free MultimOdal TOken Representation} scheme. MOTOR replaces explicit item IDs with learnable, shared multimodal tokens, fundamentally transforming the recommender into an ID-free framework. Methodologically, we first employ product quantization to discretize raw multimodal features into compact token IDs. These tokens serve as implicit item features, which are then synthesized via a novel \textbf{Token Cross Network (TCN)} to capture high-order interaction patterns. This "discretize-and-interact" mechanism enables semantic sharing across items and significantly compresses the model size without introducing complex auxiliary losses. Extensive experiments across nine mainstream models demonstrate the significant performance improvement achieved by MOTOR. Further, MOTOR improves the capability of these models to recommend items in cold-start scenarios.
翻译:多模态推荐模型虽已有效整合视觉与文本信息,但其对唯一ID嵌入的依赖构成了根本性性能瓶颈。具体而言,基于ID的范式存在三方面局限:(1)**信息隔离**——唯一ID阻止了相关物品间的语义信息交换;(2)**冷启动脆弱性**——ID嵌入难以在稀疏交互场景下优化;(3)**存储低效**——参数成本随物品数量线性增长。为克服这些挑战,我们提出**MOTOR**,一种新型**免ID多模态令牌表征**方案。MOTOR用可学习的共享多模态令牌替代显式物品ID,从根本上将推荐系统转化为无ID框架。方法层面,我们首先采用乘积量化将原始多模态特征离散化为紧凑令牌ID。这些令牌作为隐式物品特征,通过新型**令牌交叉网络(TCN)**进行合成以捕捉高阶交互模式。这种"离散化-交互"机制实现了跨物品语义共享,并在不引入复杂辅助损失的情况下显著压缩模型规模。对九种主流模型的广泛实验证明,MOTOR实现了显著的性能提升。此外,MOTOR还增强了这些模型在冷启动场景中推荐物品的能力。