Leveraging Large Language Models (LLMs) for recommendation has recently garnered considerable attention, where fine-tuning plays a key role in LLMs' adaptation. However, the cost of fine-tuning LLMs on rapidly expanding recommendation data limits their practical application. To address this challenge, few-shot fine-tuning offers a promising approach to quickly adapt LLMs to new recommendation data. We propose the task of data pruning for efficient LLM-based recommendation, aimed at identifying representative samples tailored for LLMs' few-shot fine-tuning. While coreset selection is closely related to the proposed task, existing coreset selection methods often rely on suboptimal heuristic metrics or entail costly optimization on large-scale recommendation data. To tackle these issues, we introduce two objectives for the data pruning task in the context of LLM-based recommendation: 1) high accuracy aims to identify the influential samples that can lead to high overall performance; and 2) high efficiency underlines the low costs of the data pruning process. To pursue the two objectives, we propose a novel data pruning method based on two scores, i.e., influence score and effort score, to efficiently identify the influential samples. Particularly, the influence score is introduced to accurately estimate the influence of sample removal on the overall performance. To achieve low costs of the data pruning process, we use a small-sized surrogate model to replace LLMs to obtain the influence score. Considering the potential gap between the surrogate model and LLMs, we further propose an effort score to prioritize some hard samples specifically for LLMs. Empirical results on three real-world datasets validate the effectiveness of our proposed method. In particular, the proposed method uses only 2% samples to surpass the full data fine-tuning, reducing time costs by 97%.
翻译:利用大型语言模型(LLMs)进行推荐近年来引起了广泛关注,其中微调在LLM的适配过程中扮演关键角色。然而,在快速增长的推荐数据上微调LLM的高昂成本限制了其实际应用。为应对该挑战,少样本微调提供了一种有前景的方法,使LLM能快速适应新推荐数据。我们提出了面向高效LLM推荐的数据剪枝任务,旨在识别适用于LLM少样本微调的代表性样本。尽管核心集选择与该任务密切相关,但现有核心集选择方法通常依赖次优的启发式指标,或在大规模推荐数据上需要昂贵的优化过程。为解决这些问题,我们在LLM推荐场景下为数据剪枝任务引入两个目标:1)高准确性,旨在识别能带来整体高影响力的样本;2)高效率,强调数据剪枝过程的低计算成本。为达成这两个目标,我们提出基于两种评分的新型数据剪枝方法,即影响力评分和努力度评分,以高效识别高影响力样本。其中,影响力评分用于准确估计样本移除对整体性能的影响。为降低数据剪枝过程成本,我们使用小型替代模型替代LLM来获取影响力评分。考虑到替代模型与LLM之间可能存在的差距,我们进一步提出努力度评分,优先选择对LLM而言难度较大的样本。在三个真实世界数据集上的实验结果验证了所提方法的有效性。特别地,该方法仅使用2%的样本即超越全数据微调效果,同时降低了97%的时间成本。