Versioned Late Materialization for Ultra-Long Sequence Training in Recommendation Systems at Scale

Modern Deep Learning Recommendation Models (DLRMs) follow scaling laws with sequence length, driving the frontier toward ultra-long User Interaction History (UIH). However, the industry-standard "Fat Row" paradigm, which pre-materializes these sequences into every training example, creates a storage and I/O wall where data infrastructure usage exceeds GPU training capacity due to data redundancy that is amplified in multi-tenant environments where models with vastly different sequence length requirements share a union dataset. We present a \emph{versioned late materialization} paradigm that eliminates this redundancy by storing UIH once in a normalized, immutable tier and reconstructing sequences just-in-time during training via lightweight versioned pointers. The system ensures Online-to-Offline (O2O) consistency through a bifurcated protocol that prevents future leakage across both streaming and batch training, while a read-optimized immutable storage layer provides multi-dimensional projection pushdown for heterogeneous model tenants. Disaggregated data preprocessing with pipelined I/O prefetching and data-affinity optimizations masks the latency of training-time sequence reconstruction, keeping training throughput compute-bound by GPUs. Deployed on production DLRMs, the system reduces training data infrastructure resource usage while enabling aggressive sequence length scaling that delivers significant model quality gains, serving as the foundational data infrastructure for modern recommendation model architectures, including HSTU and ULTRA-HSTU.

翻译：现代深度学习推荐模型（DLRM）遵循序列长度相关的扩展规律，推动前沿研究向超长用户交互历史（UIH）演进。然而，行业标准的"胖行"范式将用户交互序列预物化到每个训练样本中，造成了存储与I/O瓶颈——数据基础设施的消耗甚至超过GPU训练容量，这一问题在多租户环境中尤为突出：不同序列长度需求的模型共享同一数据集时，数据冗余被显著放大。本文提出一种**版本化延迟物化**范式，通过将用户交互历史（UIH）以归一化、不可变的形式存储一次，并在训练过程中通过轻量级版本指针即时重建序列，从而消除冗余。系统采用双岔协议确保在线到离线（O2O）一致性，防止流式训练与批量训练中出现未来信息泄露；同时基于读优化的不可变存储层，为异构模型租户提供多维投影下推能力。通过解耦数据预处理、流水线I/O预取与数据亲和性优化，该方案掩盖了训练时序列重建的延迟，使训练吞吐量完全受限于GPU计算能力。在生产级DLRM系统部署后，本方案在降低训练数据基础设施资源消耗的同时，实现了激进的序列长度扩展，显著提升了模型质量，并成为现代推荐模型架构（包括HSTU与ULTRA-HSTU）的基础数据设施。