Training large language models (LLMs) for external tool usage is a rapidly expanding field, with recent research focusing on generating synthetic data to address the shortage of available data. However, the absence of systematic data quality checks poses complications for properly training and testing models. To that end, we propose two approaches for assessing the reliability of data for training LLMs to use external tools. The first approach uses intuitive, human-defined correctness criteria. The second approach uses a model-driven assessment with in-context evaluation. We conduct a thorough evaluation of data quality on two popular benchmarks, followed by an extrinsic evaluation that showcases the impact of data quality on model performance. Our results demonstrate that models trained on high-quality data outperform those trained on unvalidated data, even when trained with a smaller quantity of data. These findings empirically support the significance of assessing and ensuring the reliability of training data for tool-using LLMs.
翻译:训练大语言模型(LLMs)以使用外部工具是一个快速发展的领域,近期研究聚焦于生成合成数据以解决可用数据短缺的问题。然而,系统性数据质量检验的缺失给模型的正确训练与测试带来了复杂性。为此,我们提出了两种评估训练LLMs使用外部工具的数据可靠性的方法。第一种方法采用直观的、人为定义的正确性标准。第二种方法采用基于模型的评估,结合上下文进行评估。我们在两个常用基准上对数据质量进行了全面评估,随后通过外部评估展示了数据质量对模型性能的影响。我们的结果表明,即使在训练数据量较小的情况下,使用高质量数据训练的模型性能也优于使用未经验证数据训练的模型。这些发现从实证上支持了评估和确保面向工具使用型LLMs的训练数据可靠性的重要性。