Large language models (LLMs) achieve remarkable advancements by leveraging tools to interact with external environments, a critical step toward generalized AI. However, the standard supervised fine-tuning (SFT) approach, which relies on large-scale datasets, often overlooks task-specific characteristics in tool use, leading to performance bottlenecks. To address this issue, we analyze three existing LLMs and uncover key insights: training data can inadvertently impede tool-use behavior, token importance is distributed unevenly, and errors in tool calls fall into a small set of distinct categories. Building on these findings, we propose TL-Training, a task-feature-based framework that mitigates the effects of suboptimal training data, dynamically adjusts token weights to prioritize key tokens during SFT, and incorporates a robust reward mechanism tailored to error categories, optimized through proximal policy optimization. We validate TL-Training by training CodeLLaMA-2-7B and evaluating it on four diverse open-source test sets. Our results demonstrate that the LLM trained by our method matches or surpasses both open- and closed-source LLMs in tool-use performance using only 1,217 training data points. Additionally, our method enhances robustness in noisy environments and improves general task performance, offering a scalable and efficient paradigm for tool-use training in LLMs. The code and data are available at https://github.com/Junjie-Ye/TL-Training.
翻译:大语言模型通过利用工具与外部环境交互取得了显著进展,这是迈向通用人工智能的关键一步。然而,依赖于大规模数据集的标准监督微调方法,常常忽视工具使用中的任务特定特征,导致性能瓶颈。为解决此问题,我们分析了三种现有的大语言模型,并揭示了关键发现:训练数据可能无意中阻碍工具使用行为,令牌重要性分布不均,且工具调用错误可归纳为少量不同的类别。基于这些发现,我们提出了TL-Training,一个基于任务特征的框架。该框架减轻了次优训练数据的影响,在监督微调过程中动态调整令牌权重以优先处理关键令牌,并整合了一个针对错误类别定制的鲁棒奖励机制,该机制通过近端策略优化进行优化。我们通过训练CodeLLaMA-2-7B模型,并在四个不同的开源测试集上进行评估,验证了TL-Training的有效性。我们的结果表明,使用我们方法训练的大语言模型,仅需1,217个训练数据点,其在工具使用性能上即可匹配或超越开源及闭源的大语言模型。此外,我们的方法增强了模型在噪声环境中的鲁棒性,并提升了一般任务性能,为大语言模型的工具使用训练提供了一个可扩展且高效的范式。代码与数据可在 https://github.com/Junjie-Ye/TL-Training 获取。