While Transformers have achieved remarkable success in LLMs through superior scalability, their application in industrial-scale ranking models remains nascent, hindered by the challenges of high feature sparsity and low label density. In this paper, we propose SORT (Systematically Optimized Ranking Transformer), a scalable model designed to bridge the gap between Transformers and industrial-scale ranking models. We address the high feature sparsity and low label density challenges through a series of optimizations, including request-centric sample organization, local attention, query pruning and generative pre-training. Furthermore, we introduce a suite of refinements to the tokenization, multi-head attention (MHA), and feed-forward network (FFN) modules, which collectively stabilize the training process and enlarge the model capacity. To maximize hardware efficiency, we optimize our training system to elevate the model FLOPs utilization (MFU) to 22%. Extensive experiments demonstrate that SORT outperforms strong baselines and exhibits excellent scalability across data size, model size and sequence length, while remaining flexible at integrating diverse features. Finally, online A/B testing in large-scale e-commerce scenarios confirms that SORT achieves significant gains in key business metrics, including orders (+6.35%), buyers (+5.97%) and GMV (+5.47%), while simultaneously halving latency (-44.67%) and doubling throughput (+121.33%).
翻译:尽管Transformer凭借卓越的可扩展性在大型语言模型中取得了显著成功,但其在工业级排序模型中的应用仍处于起步阶段,主要受限于高特征稀疏性和低标签密度的挑战。本文提出SORT(系统性优化排序Transformer),一种旨在弥合Transformer与工业级排序模型之间差距的可扩展模型。我们通过一系列优化方案应对高特征稀疏性和低标签密度问题,包括以请求为中心的样本组织、局部注意力机制、查询剪枝和生成式预训练。此外,我们对分词模块、多头注意力(MHA)模块和前馈网络(FFN)模块进行了系统性改进,这些改进共同稳定了训练过程并扩展了模型容量。为最大化硬件效率,我们优化了训练系统,将模型浮点运算利用率(MFU)提升至22%。大量实验表明,SORT在各项基准测试中表现优异,并在数据规模、模型规模和序列长度维度展现出卓越的可扩展性,同时能灵活整合多样化特征。最终,在大规模电商场景的在线A/B测试中,SORT在关键业务指标上取得显著提升:订单量(+6.35%)、购买用户数(+5.97%)和交易总额(+5.47%),同时实现延迟减半(-44.67%)与吞吐量翻倍(+121.33%)。