Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale

Zhaoxia, Deng,Jongsoo Park,Ping Tak Peter Tang,Haixin Liu, Jie, Yang,Hector Yuen,Jianyu Huang,Daya Khudia,Xiaohan Wei,Ellie Wen,Dhruv Choudhary,Raghuraman Krishnamoorthi,Carole-Jean Wu,Satish Nadathur,Changkyu Kim,Maxim Naumov,Sam Naghshineh,Mikhail Smelyanskiy

Tremendous success of machine learning (ML) and the unabated growth in ML model complexity motivated many ML-specific designs in both CPU and accelerator architectures to speed up the model inference. While these architectures are diverse, highly optimized low-precision arithmetic is a component shared by most. Impressive compute throughputs are indeed often exhibited by these architectures on benchmark ML models. Nevertheless, production models such as recommendation systems important to Facebook's personalization services are demanding and complex: These systems must serve billions of users per month responsively with low latency while maintaining high prediction accuracy, notwithstanding computations with many tens of billions parameters per inference. Do these low-precision architectures work well with our production recommendation systems? They do. But not without significant effort. We share in this paper our search strategies to adapt reference recommendation models to low-precision hardware, our optimization of low-precision compute kernels, and the design and development of tool chain so as to maintain our models' accuracy throughout their lifespan during which topic trends and users' interests inevitably evolve. Practicing these low-precision technologies helped us save datacenter capacities while deploying models with up to 5X complexity that would otherwise not be deployed on traditional general-purpose CPUs. We believe these lessons from the trenches promote better co-design between hardware architecture and software engineering and advance the state of the art of ML in industry.

翻译：机器学习(ML)的成功以及ML模型复杂程度的不断增长,使机器学习(ML)和ML模型复杂程度的持续增长激励了CPU和加速器结构中许多针对ML的设计,从而加速模型推断。这些结构是多种多样的,但高度优化的低精度计算算术是多数人共有的组成部分。这些基准ML模型的这些结构确实经常展示令人印象深刻的计算分数。然而,对Facebook个人化服务十分重要的建议系统等生产模型要求既复杂又复杂:这些系统必须每月为数十亿用户提供适应性低潜伏的设计,同时保持高预测准确性,尽管以数十亿参数进行计算,以加速模型推导出。这些低精度结构是否与我们的生产建议系统运作良好?它们确实如此。我们在本文中分享我们的搜索战略是将参考建议模型改用低精度的硬件,我们优化了低精度计算内核内核的内核,以及工具链的设计和发展都是为了保持我们模型整个寿命的准确性,在此期间,尽管有数十亿个参数参数参数参数和用户的利益,我们无法避免在常规工程结构结构上发展中提升我们。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

116+阅读 · 2020年4月5日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

专知会员服务

59+阅读 · 2020年1月25日