Large Language Models (LLMs) represent a promising frontier for recommender systems, yet their development has been impeded by the absence of predictable scaling laws, which are crucial for guiding research and optimizing resource allocation. We hypothesize that this may be attributed to the inherent noise, bias, and incompleteness of raw user interaction data in prior continual pre-training (CPT) efforts. This paper introduces a novel, layered framework for generating high-quality synthetic data that circumvents such issues by creating a curated, pedagogical curriculum for the LLM. We provide powerful, direct evidence for the utility of our curriculum by showing that standard sequential models trained on our principled synthetic data significantly outperform ($+130\%$ on recall@100 for SasRec) models trained on real data in downstream ranking tasks, demonstrating its superiority for learning generalizable user preference patterns. Building on this, we empirically demonstrate, for the first time, robust power-law scaling for an LLM that is continually pre-trained on our high-quality, recommendation-specific data. Our experiments reveal consistent and predictable perplexity reduction across multiple synthetic data modalities. These findings establish a foundational methodology for reliable scaling LLM capabilities in the recommendation domain, thereby shifting the research focus from mitigating data deficiencies to leveraging high-quality, structured information.
翻译:大语言模型(LLMs)为推荐系统展现了广阔前景,但其发展一直受限于可预测扩展定律的缺失,而该定律对指导研究和优化资源配置至关重要。我们推测,这可能源于先前持续预训练(CPT)工作中原始用户交互数据固有的噪声、偏差和不完整性。本文提出了一种新颖的分层框架,通过为大语言模型构建精心设计的教学课程来生成规避此类问题的高质量合成数据。我们通过实验证明:在下游排序任务中,使用我们基于原则的合成数据训练的标准序列模型(如SasRec在recall@100指标上提升+130%)显著优于使用真实数据训练的模型,这为课程设计在习得泛化性用户偏好模式方面的优越性提供了有力证据。在此基础上,我们首次通过实证研究证明:在高质量、面向推荐场景的合成数据上进行持续预训练的大语言模型,呈现出稳健的幂律扩展特性。实验表明,在多种合成数据模态下,模型困惑度均呈现一致且可预测的下降。这些发现为推荐领域大语言模型能力的可靠扩展建立了基础方法论,从而将研究重点从缓解数据缺陷转向利用高质量结构化信息。