Large language models have demonstrated impressive value in performing as autonomous agents when equipped with external tools and API calls. Nonetheless, effectively harnessing their potential for executing complex tasks crucially relies on enhancements in their function calling capabilities. This paper identifies a critical gap in existing function calling models, where performance varies significantly across benchmarks, often due to being misled by specific naming conventions. To address such an issue, we introduce Hammer, a novel family of foundation models specifically engineered for on-device function calling. Hammer employs an augmented dataset that enhances models' sensitivity to irrelevant functions and incorporates function masking techniques to minimize misleading. Our empirical evaluations reveal that Hammer not only outperforms larger models but also demonstrates robust generalization across diverse benchmarks, achieving sota results. Our open source contributions include a specialized dataset for irrelevance detection, a tuning framework for enhanced generalization, and the Hammer models, establishing a new standard for function calling performance.
翻译:大型语言模型在配备外部工具和API调用时,已展现出作为自主智能体的卓越价值。然而,要有效利用其执行复杂任务的潜力,关键在于提升其函数调用能力。本文指出现有函数调用模型存在一个关键缺陷:其性能在不同基准测试中差异显著,这通常是由于被特定命名约定所误导。为解决此问题,我们提出了Hammer——一个专为设备端函数调用设计的新型基础模型系列。Hammer采用增强数据集以提升模型对无关函数的敏感性,并结合函数掩码技术以最小化误导。我们的实证评估表明,Hammer不仅性能优于更大规模的模型,还在多样化基准测试中展现出鲁棒的泛化能力,取得了最先进的成果。我们的开源贡献包括:用于无关性检测的专用数据集、增强泛化能力的调优框架以及Hammer模型系列,为函数调用性能树立了新的标杆。