面向语义职位搜索的高效小型语言模型规模化服务与部署 (Scaling Up Efficient Small Language Models Serving and Deployment for Semantic Job Search)

Kayhan Behdin,Qingquan Song,Sriram Vasudevan,Jian Sheng,Xiaojing Ma,Z Zhou,Chuanrui Zhu,Guoyao Li,Chanh Nguyen,Sayan Ghosh,Hejian Sang,Ata Fatahi Baarzi,Sundara Raman Ramachandran,Xiaoqing Wang,Qing Lan,Vinay Y S,Qi Guo,Caleb Johnson,Zhipeng Wang,Fedor Borisyuk

Large Language Models (LLMs) have demonstrated impressive quality when applied to predictive tasks such as relevance ranking and semantic search. However, deployment of such LLMs remains prohibitively expensive for industry applications with strict latency and throughput requirements. In this work, we present lessons and efficiency insights from developing a purely text-based decoder-only Small Language Model (SLM) for a semantic search application at LinkedIn. Particularly, we discuss model compression techniques such as pruning that allow us to reduce the model size by up to $40\%$ while maintaining the accuracy. Additionally, we present context compression techniques that allow us to reduce the input context length by up to $10$x with minimal loss of accuracy. Finally, we present practical lessons from optimizing the serving infrastructure for deploying such a system on GPUs at scale, serving millions of requests per second. Taken together, this allows us to increase our system's throughput by $10$x in a real-world deployment, while meeting our quality bar.

翻译：大型语言模型（LLMs）在相关性排序和语义搜索等预测任务中展现出卓越的性能。然而，对于具有严格延迟和吞吐量要求的工业应用而言，部署此类LLMs的成本仍然过高。本研究展示了在LinkedIn开发纯文本解码器专用小型语言模型（SLM）用于语义搜索应用的经验与效率洞见。具体而言，我们讨论了模型压缩技术（如剪枝），该技术可使模型尺寸减少高达40%的同时保持精度。此外，我们提出了上下文压缩技术，可在精度损失最小的情况下将输入上下文长度压缩至原来的1/10。最后，我们分享了在GPU上规模化部署此类系统（每秒处理数百万请求）时优化服务基础设施的实践经验。综合这些技术，我们在实际部署中将系统吞吐量提升了10倍，同时满足了质量要求。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日