Large Language Models (LLMs) have demonstrated impressive quality when applied to predictive tasks such as relevance ranking and semantic search. However, deployment of such LLMs remains prohibitively expensive for industry applications with strict latency and throughput requirements. In this work, we present lessons and efficiency insights from developing a purely text-based decoder-only Small Language Model (SLM) for a semantic search application at LinkedIn. Particularly, we discuss model compression techniques such as pruning that allow us to reduce the model size by up to $40\%$ while maintaining the accuracy. Additionally, we present context compression techniques that allow us to reduce the input context length by up to $10$x with minimal loss of accuracy. Finally, we present practical lessons from optimizing the serving infrastructure for deploying such a system on GPUs at scale, serving millions of requests per second. Taken together, this allows us to increase our system's throughput by $10$x in a real-world deployment, while meeting our quality bar.
翻译:大型语言模型(LLMs)在相关性排序和语义搜索等预测任务中展现出卓越的性能。然而,对于具有严格延迟和吞吐量要求的工业应用而言,部署此类LLMs的成本仍然过高。本研究展示了在LinkedIn开发纯文本解码器专用小型语言模型(SLM)用于语义搜索应用的经验与效率洞见。具体而言,我们讨论了模型压缩技术(如剪枝),该技术可使模型尺寸减少高达40%的同时保持精度。此外,我们提出了上下文压缩技术,可在精度损失最小的情况下将输入上下文长度压缩至原来的1/10。最后,我们分享了在GPU上规模化部署此类系统(每秒处理数百万请求)时优化服务基础设施的实践经验。综合这些技术,我们在实际部署中将系统吞吐量提升了10倍,同时满足了质量要求。