Sequential Data Augmentation for Generative Recommendation

Generative recommendation plays a crucial role in personalized systems, predicting users' future interactions from their historical behavior sequences. A critical yet underexplored factor in training these models is data augmentation, the process of constructing training data from user interaction histories. By shaping the training distribution, data augmentation directly and often substantially affects model generalization and performance. Nevertheless, in much of the existing work, this process is simplified, applied inconsistently, or treated as a minor design choice, without a systematic and principled understanding of its effects. Motivated by our empirical finding that different augmentation strategies can yield large performance disparities, we conduct an in-depth analysis of how they reshape training distributions and influence alignment with future targets and generalization to unseen inputs. To systematize this design space, we propose GenPAS, a generalized and principled framework that models augmentation as a stochastic sampling process over input-target pairs with three bias-controlled steps: sequence sampling, target sampling, and input sampling. This formulation unifies widely used strategies as special cases and enables flexible control of the resulting training distribution. Our extensive experiments on benchmark and industrial datasets demonstrate that GenPAS yields superior accuracy, data efficiency, and parameter efficiency compared to existing strategies, providing practical guidance for principled training data construction in generative recommendation. Our code is available at https://github.com/snap-research/GenPAS.

翻译：生成式推荐在个性化系统中起着关键作用，通过用户的历史行为序列预测其未来交互行为。在训练这些模型时，一个关键但尚未充分探索的因素是数据增强，即从用户交互历史中构建训练数据的过程。通过塑造训练数据的分布，数据增强直接且显著地影响模型的泛化能力与性能。然而，在现有工作中，这一过程往往被简化、应用不一致或仅被视为次要的设计选择，缺乏对其影响的系统性和原理性理解。基于我们的实证发现——不同增强策略可能导致巨大的性能差异，我们深入分析了这些策略如何重塑训练分布、如何影响与未来目标的匹配以及对未见输入的泛化能力。为了系统化这一设计空间，我们提出了GenPAS，一个通用的、基于原理的框架。该框架将数据增强建模为输入-目标对的随机采样过程，包含三个受偏差控制的步骤：序列采样、目标采样和输入采样。这一公式化设计将广泛使用的策略统一为特例，并能够灵活控制最终训练数据的分布。在基准数据集和工业数据集上的大量实验表明，与现有策略相比，GenPAS在准确性、数据效率和参数效率方面均表现更优，为生成式推荐中原则性的训练数据构建提供了实践指导。我们的代码已开源在https://github.com/snap-research/GenPAS。