Large Language Models (LLMs) represent a promising frontier for recommender systems, yet their development has been impeded by the absence of predictable scaling laws, which are crucial for guiding research and optimizing resource allocation. We hypothesize that this may be attributed to the inherent noise, bias, and incompleteness of raw user interaction data in prior continual pre-training (CPT) efforts. This paper introduces a novel, layered framework for generating high-quality synthetic data that circumvents such issues by creating a curated, pedagogical curriculum for the LLM. We provide powerful, direct evidence for the utility of our curriculum by showing that standard sequential models trained on our principled synthetic data significantly outperform ($+130\%$ on recall@100 for SasRec) models trained on real data in downstream ranking tasks, demonstrating its superiority for learning generalizable user preference patterns. Building on this, we empirically demonstrate, for the first time, robust power-law scaling for an LLM that is continually pre-trained on our high-quality, recommendation-specific data. Our experiments reveal consistent and predictable perplexity reduction across multiple synthetic data modalities. These findings establish a foundational methodology for reliable scaling LLM capabilities in the recommendation domain, thereby shifting the research focus from mitigating data deficiencies to leveraging high-quality, structured information.
翻译:大语言模型为推荐系统带来了广阔前景,但其发展受限于缺乏可预测的缩放定律——这一规律对于指导研究和优化资源配置至关重要。我们假设,先前持续预训练方法中原始用户交互数据固有的噪声、偏差和不完整性可能是导致该现象的原因。本文提出了一种新颖的分层框架来生成高质量合成数据,通过为大语言模型创建精心策划的教学课程来规避此类问题。我们提供了强有力且直接的证据证明课程的有效性:在基于原则的合成数据上训练的标准序列模型,在下游排序任务中显著优于基于真实数据训练的模型(SasRec在recall@100上提升130%),展示出学习泛化用户偏好模式的优越性。在此基础上,我们首次通过实验证明,在面向推荐的高质量数据上进行持续预训练的大语言模型存在稳健的幂律缩放规律。实验表明,多种合成数据模态均呈现一致且可预测的困惑度降低。这些发现为在推荐领域可靠扩展大语言模型能力奠定了方法论基础,从而将研究焦点从缓解数据缺陷转向利用高质量结构化信息。