Many companies rely on APIs of managed AI models such as OpenAI's GPT-4 to create AI-enabled experiences in their products. Along with the benefits of ease of use and shortened time to production, this reliance on proprietary APIs has downsides in terms of model control, performance reliability, up-time predictability, and cost. At the same time, there has been a flurry of open source small language models (SLMs) that have been made available for commercial use. However, their readiness to replace existing capabilities remains unclear, and a systematic approach to test these models is not readily available. In this paper, we present a systematic evaluation methodology for, and characterization of, modern open source SLMs and their trade-offs when replacing a proprietary LLM APIs for a real-world product feature. We have designed SLaM, an automated analysis tool that enables the quantitative and qualitative testing of product features utilizing arbitrary SLMs. Using SLaM, we examine both the quality and the performance characteristics of modern SLMs relative to an existing customer-facing OpenAI-based implementation. We find that across 9 SLMs and 29 variants, we observe competitive quality-of-results for our use case, significant performance consistency improvement, and a cost reduction of 5x-29x when compared to OpenAI GPT-4.
翻译:许多公司依赖托管AI模型的API(如OpenAI的GPT-4)在其产品中构建AI赋能体验。尽管这种对专有API的依赖带来了易用性和缩短生产周期的优势,但在模型控制、性能可靠性、运行时间可预测性和成本方面存在缺陷。与此同时,大量面向商业用途的开源小型语言模型(SLM)已被发布。然而,它们替代现有能力的准备程度仍不明确,且缺乏现成的系统化测试方法。本文针对实际产品特性中替换专有LLM API的场景,提出了一套系统化的评估方法,并对现代开源SLM及其权衡特性进行了刻画。我们设计了SLaM这一自动化分析工具,能够利用任意SLM对产品特性进行定量与定性测试。借助SLaM,我们检验了现代SLM相对于现有基于OpenAI的客户实现方案的质量与性能特征。研究发现,在9种SLM及其29个变体中,我们的用例获得了具有竞争力的结果质量,性能一致性显著提升,且相较于OpenAI GPT-4,成本降低了5倍至29倍。