Language models have shown effectiveness in a variety of software applications, particularly in tasks related to automatic workflow. These models possess the crucial ability to call functions, which is essential in creating AI agents. Despite the high performance of large-scale language models in cloud environments, they are often associated with concerns over privacy and cost. Current on-device models for function calling face issues with latency and accuracy. Our research presents a new method that empowers an on-device model with 2 billion parameters to surpass the performance of GPT-4 in both accuracy and latency, and decrease the context length by 95\%. When compared to Llama-7B with a RAG-based function calling mechanism, our method enhances latency by 35-fold. This method reduces the latency to levels deemed suitable for deployment across a variety of edge devices in production environments, aligning with the performance requisites for real-world applications.
翻译:语言模型在各类软件应用中已展现出显著效果,尤其在自动化工作流相关任务中表现突出。这类模型具备关键的函数调用能力,这对构建AI智能体至关重要。尽管大规模语言模型在云端环境下性能卓越,但其往往伴随着隐私与成本方面的隐忧。当前面向函数调用的设备端模型面临着延迟与准确性的双重挑战。本研究提出了一种新方法,使拥有20亿参数的设备端模型在准确性与延迟两个维度上均超越GPT-4的性能,并将上下文长度缩减95%。与采用基于RAG的函数调用机制的Llama-7B相比,本方法将延迟提升了35倍。该方法可将延迟降低至适合在各类生产环境边缘设备中部署的水平,满足实际应用场景的性能要求。