UniRec: Unified Multimodal Encoding for LLM-Based Recommendations

Large language models have recently shown promise for multimodal recommendation, particularly with text and image inputs. Yet real-world recommendation signals extend far beyond these modalities. To reflect this, we formalize recommendation features into four modalities: text, images, categorical features, and numerical attributes, and highlight the unique challenges this heterogeneity poses for LLMs in understanding multimodal information. In particular, these challenges arise not only across modalities but also within them, as attributes such as price, rating, and time may all be numeric yet carry distinct semantic meanings. Beyond this intra-modality ambiguity, another major challenge is the nested structure of recommendation signals, where user histories are sequences of items, each associated with multiple attributes. To address these challenges, we propose UniRec, a unified multimodal encoder for LLM-based recommendation. UniRec first employs modality-specific encoders to produce consistent embeddings across heterogeneous signals. It then adopts a triplet representation, comprising attribute name, type, and value, to separate schema from raw inputs and preserve semantic distinctions. Finally, a hierarchical Q-Former models the nested structure of user interactions while maintaining their layered organization. Across multiple real-world benchmarks, UniRec outperforms state-of-the-art multimodal and LLM-based recommenders by up to 15%, and extensive ablation studies further validate the contributions of each component.

翻译：近年来，大语言模型在多模态推荐领域展现出巨大潜力，尤其是在处理文本与图像输入方面。然而，现实世界中的推荐信号远不止于这些模态。为准确反映这一现实，我们将推荐特征形式化地归纳为四种模态：文本、图像、分类特征与数值属性，并指出这种异质性为大语言模型理解多模态信息带来的独特挑战。特别值得注意的是，这些挑战不仅存在于不同模态之间，也存在于同一模态内部——例如价格、评分与时间等属性虽同为数值型，却承载着截然不同的语义信息。除模态内部歧义外，另一主要挑战在于推荐信号的嵌套结构：用户历史行为是由物品组成的序列，而每个物品又关联着多重属性。为应对这些挑战，我们提出UniRec——一个面向大语言模型推荐系统的统一多模态编码器。UniRec首先采用模态专用编码器为异质信号生成一致的嵌入表示；随后通过包含属性名称、类型与值的三元组表示形式，将数据模式与原始输入分离以保持语义区分度；最后通过分层Q-Former对用户交互的嵌套结构进行建模，同时维持其层级化组织特性。在多个真实场景基准测试中，UniRec以最高15%的优势超越当前最先进的多模态及大语言模型推荐系统，深入的消融实验进一步验证了各组件的重要贡献。