Large Language Models (LLMs) have shown remarkable capabilities in various domains, yet their economic impact has been limited by challenges in tool use and function calling. This paper introduces ThorV2, a novel architecture that significantly enhances LLMs' function calling abilities. We develop a comprehensive benchmark focused on HubSpot CRM operations to evaluate ThorV2 against leading models from OpenAI and Anthropic. Our results demonstrate that ThorV2 outperforms existing models in accuracy, reliability, latency, and cost efficiency for both single and multi-API calling tasks. We also show that ThorV2 is far more reliable and scales better to multistep tasks compared to traditional models. Our work offers the tantalizing possibility of more accurate function-calling compared to today's best-performing models using significantly smaller LLMs. These advancements have significant implications for the development of more capable AI assistants and the broader application of LLMs in real-world scenarios.
翻译:大语言模型(LLMs)已在多个领域展现出卓越能力,但其经济影响一直受限于工具使用和函数调用方面的挑战。本文介绍了一种新型架构ThorV2,它能显著增强大语言模型的函数调用能力。我们开发了一个专注于HubSpot CRM操作的综合基准测试,用以评估ThorV2相对于OpenAI和Anthropic领先模型的性能。我们的结果表明,在单API和多API调用任务中,ThorV2在准确性、可靠性、延迟和成本效益方面均优于现有模型。我们还证明,与传统模型相比,ThorV2的可靠性更高,并且在多步骤任务中具有更好的扩展性。我们的工作揭示了一种诱人的可能性:即使使用规模显著更小的大语言模型,也能实现比当前最佳性能模型更精确的函数调用。这些进展对于开发能力更强的人工智能助手以及大语言模型在现实场景中的更广泛应用具有重要意义。