Language models have shown effectiveness in a variety of software applications, particularly in tasks related to automatic workflow. These models possess the crucial ability to call functions, which is essential in creating AI agents. Despite the high performance of large-scale language models in cloud environments, they are often associated with concerns over privacy and cost. Current on-device models for function calling face issues with latency and accuracy. Our research presents a new method that empowers an on-device model with 2 billion parameters to surpass the performance of GPT-4 in both accuracy and latency, and decrease the context length by 95\%. When compared to Llama-7B with a RAG-based function calling mechanism, our method enhances latency by 35-fold. This method reduces the latency to levels deemed suitable for deployment across a variety of edge devices in production environments, aligning with the performance requisites for real-world applications.
翻译:语言模型在多种软件应用中展现出有效性,特别是在自动化工作流相关任务中。这些模型具备函数调用的关键能力,这对创建AI代理至关重要。尽管大规模语言模型在云环境中表现优异,但其常伴随隐私与成本方面的顾虑。当前用于函数调用的端侧模型存在延迟与准确性问题。本研究提出一种新方法,使20亿参数的端侧模型在准确率和延迟方面超越GPT-4,并将上下文长度缩减95%。与基于RAG函数调用机制的Llama-7B相比,本方法将延迟提升35倍。该方法将延迟降至适合生产环境中各类边缘设备部署的水平,满足实际应用的性能要求。