Large Language Models (LLMs) represent a promising frontier for recommender systems, yet their development has been impeded by the absence of predictable scaling laws, which are crucial for guiding research and optimizing resource allocation. We hypothesize that this may be attributed to the inherent noise, bias, and incompleteness of raw user interaction data in prior continual pre-training (CPT) efforts. This paper introduces a novel, layered framework for generating high-quality synthetic data that circumvents such issues by creating a curated, pedagogical curriculum for the LLM. We provide powerful, direct evidence for the utility of our curriculum by showing that standard sequential models trained on our principled synthetic data significantly outperform ($+130\%$ on recall@100 for SasRec) models trained on real data in downstream ranking tasks, demonstrating its superiority for learning generalizable user preference patterns. Building on this, we empirically demonstrate, for the first time, robust power-law scaling for an LLM that is continually pre-trained on our high-quality, recommendation-specific data. Our experiments reveal consistent and predictable perplexity reduction across multiple synthetic data modalities. These findings establish a foundational methodology for reliable scaling LLM capabilities in the recommendation domain, thereby shifting the research focus from mitigating data deficiencies to leveraging high-quality, structured information.
翻译:大语言模型(LLMs)为推荐系统开辟了前景广阔的新前沿,但其发展一直受限于可预测扩展定律的缺失,而这类定律对于指导研究和优化资源配置至关重要。我们推测,这可能源于先前持续预训练(CPT)工作中原始用户交互数据固有的噪声、偏见和不完整性。本文提出了一种新颖的分层框架,用于生成高质量合成数据,该框架通过为大语言模型构建精心设计的教学课程,规避了上述问题。我们提供了有力且直接的证据来证明我们课程的有效性:在标准序列模型上,使用我们基于原则的合成数据进行训练的模型,在下游排序任务中显著优于(在SasRec的recall@100指标上提升$+130\%$)使用真实数据训练的模型,这证明了其在学习可泛化的用户偏好模式方面的优越性。在此基础上,我们首次通过实证研究证明,对于一个在我们高质量、推荐专用的数据上进行持续预训练的大语言模型,存在稳健的幂律扩展规律。我们的实验表明,在多种合成数据模态下,模型困惑度的降低具有一致性和可预测性。这些发现为在推荐领域可靠地扩展大语言模型能力奠定了一种基础性方法论,从而将研究重点从缓解数据缺陷转向利用高质量、结构化的信息。