Modern techniques in Content-based Recommendation (CBR) leverage item content information to provide personalized services to users, but suffer from resource-intensive training on large datasets. To address this issue, we explore the dataset condensation for textual CBR in this paper. The goal of dataset condensation is to synthesize a small yet informative dataset, upon which models can achieve performance comparable to those trained on large datasets. While existing condensation approaches are tailored to classification tasks for continuous data like images or embeddings, direct application of them to CBR has limitations. To bridge this gap, we investigate efficient dataset condensation for content-based recommendation. Inspired by the remarkable abilities of large language models (LLMs) in text comprehension and generation, we leverage LLMs to empower the generation of textual content during condensation. To handle the interaction data involving both users and items, we devise a dual-level condensation method: content-level and user-level. At content-level, we utilize LLMs to condense all contents of an item into a new informative title. At user-level, we design a clustering-based synthesis module, where we first utilize LLMs to extract user interests. Then, the user interests and user embeddings are incorporated to condense users and generate interactions for condensed users. Notably, the condensation paradigm of this method is forward and free from iterative optimization on the synthesized dataset. Extensive empirical findings from our study, conducted on three authentic datasets, substantiate the efficacy of the proposed method. Particularly, we are able to approximate up to 97% of the original performance while reducing the dataset size by 95% (i.e., on dataset MIND).
翻译:基于内容推荐(CBR)的现代技术利用项目内容信息为用户提供个性化服务,但其在大型数据集上的训练过程资源消耗巨大。为应对这一问题,本文探索了文本CBR领域的数据集浓缩方法。数据集浓缩的目标是合成一个规模小但信息丰富的数据集,使得在其上训练的模型能达到与在大型数据集上训练相当的性能。现有浓缩方法主要针对图像或嵌入向量等连续数据的分类任务而设计,将其直接应用于CBR存在局限性。为弥补这一差距,本研究致力于探索面向基于内容推荐的高效数据集浓缩技术。受大型语言模型(LLM)在文本理解与生成方面卓越能力的启发,我们利用LLM来增强浓缩过程中的文本内容生成。为处理涉及用户与项目的交互数据,我们设计了一种双层浓缩方法:内容层与用户层。在内容层,我们利用LLM将项目的所有内容浓缩为新的信息性标题。在用户层,我们设计了基于聚类的合成模块:首先利用LLM提取用户兴趣,随后结合用户兴趣与用户嵌入表示来浓缩用户并为浓缩用户生成交互记录。值得注意的是,该方法的浓缩范式是前向的,无需在合成数据集上进行迭代优化。我们在三个真实数据集上开展的广泛实证研究结果验证了所提方法的有效性。特别是在MIND数据集上,我们能在将数据集规模缩减95%的同时,达到原始性能的97%。