Discrete Preference Learning for Personalized Multimodal Generation

The emergence of generative models enables the creation of texts and images tailored to users' preferences. Existing personalized generative models have two critical limitations: lacking a dedicated paradigm for accurate preference modeling, and generating unimodal content despite real-world multimodal-driven user interactions. Therefore, we propose personalized multimodal generation, which captures modal-specific preferences via a dedicated preference model from multimodal interactions, and then feeds them into downstream generators for personalized multimodal content. However, this task presents two challenges: (1) Gap between continuous preferences from dedicated modeling and discrete token inputs intrinsic to generator architectures; (2) Potential inconsistency between generated images and texts. To tackle these, we present a two-stage framework called Discrete Preference learning for Personalized Multimodal Generation (DPPMG). In the first stage, to accurately learn discrete modal-specific preferences, we introduce a modal-specific graph neural network (a dedicated preference model) to learn users' modal-specific preferences, which preferences are then quantized into discrete preference tokens. In the second stage, the discrete modal-specific preference tokens are injected into downstream text and image generators. To further enhance cross-modal consistency while preserving personalization, we design a cross-modal consistent and personalized reward to fine-tune token-associated parameters. Extensive experiments on two real-world datasets demonstrate the effectiveness of our model in generating personalized and consistent multimodal content.

翻译：生成模型的兴起使得能够根据用户偏好定制生成文本与图像。现有个性化生成模型存在两个关键局限：缺乏针对精准偏好建模的专用范式，且在现实多模态驱动的用户交互场景下仅能生成单模态内容。为此，我们提出个性化多模态生成任务，通过从多模态交互中构建专用偏好模型捕获模态特定偏好，并将其输入下游生成器以生成个性化多模态内容。然而，该任务面临两大挑战：(1) 专用建模产生的连续偏好与生成器架构固有的离散令牌输入之间存在差异；(2) 生成图像与文本之间可能存在潜在不一致性。针对这些问题，我们提出名为DPPMG的双阶段框架：第一阶段，为精确学习离散的模态特定偏好，我们引入模态特定图神经网络（专用偏好模型）来学习用户的模态特定偏好，并将偏好量化离散为偏好令牌；第二阶段，将离散的模态特定偏好令牌注入下游文本与图像生成器。为在保持个性化特性的同时增强跨模态一致性，我们设计跨模态一致且个性化的奖励函数以微调令牌关联参数。在两个真实世界数据集上的大量实验表明，本模型在生成个性化且一致的多模态内容方面具有显著有效性。