Recent advancements in Large Language Models (LLMs) have demonstrated exceptional capabilities in natural language understanding and generation. While these models excel in general complex reasoning tasks, they still face challenges in mathematical problem-solving and logical reasoning. To address these limitations, researchers have explored function calling abilities, allowing LLMs to execute provided functions and utilize their outputs for task completion. However, concentrating on specific tasks can be very inefficient for large-scale LLMs to be used, because of the expensive cost of training and inference stages they need in terms of computational resources. This study introduces a novel framework for training smaller language models in function calling, focusing on specific logical and mathematical reasoning tasks. The approach aims to improve performances of small-scale models for these tasks using function calling, ensuring a high level of accuracy. Our framework employs an agent that, given a problem and a set of callable functions, queries the LLM by injecting a description and examples of the usable functions into the prompt and managing their calls in a step-by-step reasoning chain. This process is used to create a dataset of correct and incorrect reasoning chain chat completions from a large-scale LLM. This dataset is used to train a smaller LLM using Reinforcement Learning from Human Feedback (RLHF), specifically employing the Direct Preference Optimization (DPO) technique. Experimental results demonstrate how the proposed approach balances the trade-off between model size and performance, improving the ability of function calling for reasoning tasks, in smaller models.
翻译:近年来,大规模语言模型(LLMs)在自然语言理解与生成方面展现出卓越能力。尽管这些模型在一般复杂推理任务中表现出色,但在数学问题求解和逻辑推理方面仍面临挑战。为应对这些局限,研究者探索了函数调用能力,使LLMs能够执行预设函数并利用其输出完成任务。然而,对于大规模LLMs而言,专注于特定任务可能效率低下,因其在训练和推理阶段需要耗费高昂的计算资源。本研究提出一种新颖框架,用于训练小型语言模型在特定逻辑与数学推理任务中的函数调用能力。该方法旨在通过函数调用提升小型模型在此类任务中的表现,确保高准确度。我们的框架采用智能体机制:给定问题与一组可调用函数,通过向提示中注入可用函数的描述与示例,并在逐步推理链中管理函数调用,从而查询LLM。该流程用于从大规模LLM生成正确与错误推理链的对话补全数据集。此数据集通过人类反馈强化学习(RLHF)训练小型LLM,特别采用直接偏好优化(DPO)技术。实验结果表明,所提方法在模型规模与性能间取得平衡,有效提升了小型模型在推理任务中的函数调用能力。