Quantized Inference for OneRec-V2

Quantized inference has demonstrated substantial system-level benefits in large language models while preserving model quality. In contrast, reliably applying low-precision quantization to recommender systems remains challenging in industrial settings. This difficulty arises from differences in training paradigms, architectural patterns, and computational characteristics, which lead to distinct numerical behaviors in weights and activations. Traditional recommender models often exhibit high-magnitude and high-variance weights and activations, making them more sensitive to quantization-induced perturbations. In addition, recommendation workloads frequently suffer from limited hardware utilization, limiting the practical gains of low-precision computation. In this work, we revisit low-precision inference in the context of generative recommendation. Through empirical distribution analysis, we show that the weight and activation statistics of OneRec-V2 are significantly more controlled and closer to those of large language models than traditional recommendation models. Moreover, OneRec-V2 exhibits a more compute-intensive inference pattern with substantially higher hardware utilization, enabling more end-to-end throughput gains with low-precision computation. Leveraging this property, we develop a FP8 post training quantization framework and integrate it into an optimized inference infrastructure. The proposed joint optimization achieves a 49\% reduction in end-to-end inference latency and a 92\% increase in throughput. Extensive online A/B testing further confirms that FP8 inference introduces no degradation in core metrics. These results suggest that as recommender systems evolve toward the paradigms of large language models, algorithm-level and system-level optimization techniques established in the LLM domain can be effectively adapted to large-scale recommendation workloads.

翻译：量化推理在大型语言模型中已展现出显著的系统级优势，同时保持了模型质量。相比之下，在工业场景中，将低精度量化可靠地应用于推荐系统仍然具有挑战性。这一困难源于训练范式、架构模式和计算特性的差异，这些差异导致了权重和激活值表现出不同的数值行为。传统的推荐模型通常呈现出高幅值、高方差的权重和激活值，使其对量化引入的扰动更为敏感。此外，推荐工作负载常常面临硬件利用率不足的问题，从而限制了低精度计算的实际收益。在本工作中，我们重新审视了生成式推荐背景下的低精度推理问题。通过经验分布分析，我们表明OneRec-V2的权重与激活值统计量相比传统推荐模型受到显著更好的控制，且更接近大型语言模型的统计特性。此外，OneRec-V2展现出更具计算密集型的推理模式，其硬件利用率显著更高，从而使得低精度计算能带来更大的端到端吞吐量提升。利用这一特性，我们开发了一个FP8训练后量化框架，并将其集成到优化的推理基础设施中。所提出的联合优化实现了端到端推理延迟降低49%，吞吐量提升92%。广泛的在线A/B测试进一步证实，FP8推理未对核心指标引入任何性能下降。这些结果表明，随着推荐系统向大型语言模型的范式演进，在LLM领域已确立的算法级与系统级优化技术能够有效地适配于大规模推荐工作负载。