Large language models have recently shown promise for multimodal recommendation, particularly with text and image inputs. Yet real-world recommendation signals extend far beyond these modalities. To reflect this, we formalize recommendation features into four modalities: text, images, categorical features, and numerical attributes, and highlight the unique challenges this heterogeneity poses for LLMs in understanding multimodal information. In particular, these challenges arise not only across modalities but also within them, as attributes such as price, rating, and time may all be numeric yet carry distinct semantic meanings. Beyond this intra-modality ambiguity, another major challenge is the nested structure of recommendation signals, where user histories are sequences of items, each associated with multiple attributes. To address these challenges, we propose UniRec, a unified multimodal encoder for LLM-based recommendation. UniRec first employs modality-specific encoders to produce consistent embeddings across heterogeneous signals. It then adopts a triplet representation, comprising attribute name, type, and value, to separate schema from raw inputs and preserve semantic distinctions. Finally, a hierarchical Q-Former models the nested structure of user interactions while maintaining their layered organization. Across multiple real-world benchmarks, UniRec outperforms state-of-the-art multimodal and LLM-based recommenders by up to 15%, and extensive ablation studies further validate the contributions of each component.
翻译:近年来,大语言模型在文本与图像输入的多模态推荐任务中展现出潜力。然而,现实世界中的推荐信号远不止这些模态。为此,我们将推荐特征形式化为四种模态:文本、图像、分类特征与数值属性,并强调这种异质性给大语言模型理解多模态信息带来的独特挑战。这些挑战不仅存在于模态之间,也存在于模态内部——例如价格、评分、时间等属性虽同为数值型,却承载着截然不同的语义含义。除了模态内部的歧义性,另一主要挑战在于推荐信号的嵌套结构:用户历史记录是由物品组成的序列,而每个物品又关联着多种属性。为应对这些挑战,我们提出UniRec——一种面向基于大语言模型推荐系统的统一多模态编码器。UniRec首先采用模态专用编码器为异质信号生成一致的嵌入表示;随后通过包含属性名称、类型与值的三元组表示形式,将模式信息与原始输入分离以保持语义区分;最后,通过分层Q-Former对用户交互的嵌套结构进行建模,同时维持其层级化组织方式。在多个现实世界基准测试中,UniRec以最高15%的优势超越了当前最先进的多模态及基于大语言模型的推荐系统,大量消融实验进一步验证了各模块的贡献。