Generative recommendation (GR) has emerged as a promising paradigm that replaces fragmented, scenario-specific architectures with unified Transformer-based models, exhibiting scaling-law behavior where recommendation quality improves systematically with increased model capacity and training data. However, deploying GR at scale on Ascend NPUs faces fundamental system-level challenges. These challenges are further exacerbated on Ascend NPUs due to the absence of high-performance implementations for jagged operators and the architectural mismatch between irregular sparse primitives and NPU's dense-computation-optimized design. In this paper, we present \model, an Ascend-affinity training system for generative recommendation that systematically addresses these bottlenecks through three core innovations: (i) Ascend-affinity jagged acceleration, including fusion operators that eliminate padding redundancy and dynamic load balancing that reduces inter-device imbalance from 47\% to 2.4\%; (ii) distributed communication optimization, comprising hierarchical sparse parallelism, semi-asynchronous training with proven convergence guarantees, and fine-grained pipeline orchestration that sustains 94\% NPU utilization; and (iii) negative sampling optimization via asynchronous offloading, jaggedness-aware FP16 quantization, and intra-batch logit sharing that expand the effective negative space without additional embedding lookups. Evaluated on the KuaiRand-27K dataset, \model supports training at up to 0.2B parameters and achieves 54.71\% MFU with near-linear scalability (0.97).
翻译:生成式推荐(GR)已成为一种有前景的范式,其用统一的基于Transformer的模型取代碎片化、特定场景的架构,展现出规模律行为——推荐质量随模型容量和训练数据的增加而系统性提升。然而,在昇腾NPU上大规模部署GR面临基础性的系统级挑战。由于缺少针对锯齿算子的高性能实现,以及不规则稀疏基元与NPU密集计算优化设计之间的架构不匹配,这些挑战在昇腾NPU上进一步加剧。本文提出\model——一个面向生成式推荐且适配昇腾的训练系统,通过三项核心创新系统性解决上述瓶颈:(i)昇腾适配的锯齿加速技术,包括消除填充冗余的融合算子,以及将设备间负载不均衡度从47%降至2.4%的动态负载均衡机制;(ii)分布式通信优化,涵盖层级化稀疏并行、具有可证明收敛保证的半异步训练,以及实现94% NPU利用率的细粒度流水线编排;(iii)基于异步卸载、锯齿感知FP16量化和批内Logit共享的负采样优化,无需额外嵌入查询即可扩展有效负样本空间。在KuaiRand-27K数据集上的评估表明,\model支持多达0.2B参数训练,达到54.71%的模型计算利用率,并实现近线性可扩展性(0.97)。