TF-DCon: Leveraging Large Language Models (LLMs) to Empower Training-Free Dataset Condensation for Content-Based Recommendation

Modern techniques in Content-based Recommendation (CBR) leverage item content information to provide personalized services to users, but suffer from resource-intensive training on large datasets. To address this issue, we explore the dataset condensation for textual CBR in this paper. The goal of dataset condensation is to synthesize a small yet informative dataset, upon which models can achieve performance comparable to those trained on large datasets. While existing condensation approaches are tailored to classification tasks for continuous data like images or embeddings, direct application of them to CBR has limitations. To bridge this gap, we investigate efficient dataset condensation for content-based recommendation. Inspired by the remarkable abilities of large language models (LLMs) in text comprehension and generation, we leverage LLMs to empower the generation of textual content during condensation. To handle the interaction data involving both users and items, we devise a dual-level condensation method: content-level and user-level. At content-level, we utilize LLMs to condense all contents of an item into a new informative title. At user-level, we design a clustering-based synthesis module, where we first utilize LLMs to extract user interests. Then, the user interests and user embeddings are incorporated to condense users and generate interactions for condensed users. Notably, the condensation paradigm of this method is forward and free from iterative optimization on the synthesized dataset. Extensive empirical findings from our study, conducted on three authentic datasets, substantiate the efficacy of the proposed method. Particularly, we are able to approximate up to 97% of the original performance while reducing the dataset size by 95% (i.e., on dataset MIND).

翻译：基于内容推荐（CBR）的现代技术利用项目内容信息为用户提供个性化服务，但其在大型数据集上的训练过程资源消耗巨大。为应对此问题，本文探索了文本CBR中的数据集浓缩技术。数据集浓缩的目标是合成一个规模小但信息量大的数据集，使得基于该数据集训练的模型能达到与在大型数据集上训练相当的性能。现有浓缩方法主要针对图像或嵌入向量等连续数据的分类任务而设计，直接应用于CBR存在局限。为弥合这一差距，本研究致力于实现基于内容推荐的高效数据集浓缩。受大型语言模型（LLM）在文本理解与生成方面卓越能力的启发，我们利用LLM增强浓缩过程中的文本内容生成。为处理涉及用户与项目的交互数据，我们设计了双层浓缩方法：内容层与用户层。在内容层，我们利用LLM将项目的所有内容浓缩为新的信息性标题。在用户层，我们设计了基于聚类的合成模块：首先利用LLM提取用户兴趣，随后结合用户兴趣与用户嵌入表示来浓缩用户并为浓缩后的用户生成交互记录。值得注意的是，该方法的浓缩范式是前向的，且无需在合成数据集上进行迭代优化。我们在三个真实数据集上开展的广泛实证研究结果验证了所提方法的有效性。特别地，在MIND数据集上，我们能在将数据集规模缩减95%的同时，达到原始性能的97%。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日