Efficient data selection is crucial to accelerate the pretraining of large language models (LLMs). While various methods have been proposed to enhance data efficiency, limited research has addressed the inherent conflicts between these approaches to achieve optimal data selection for LLM pretraining. To tackle this problem, we propose a novel multi-agent collaborative data selection mechanism. In this framework, each data selection method serves as an independent agent, and an agent console is designed to dynamically integrate the information from all agents throughout the LLM training process. We conduct extensive empirical studies to evaluate our multi-agent framework. The experimental results demonstrate that our approach significantly improves data efficiency, accelerates convergence in LLM training, and achieves an average performance gain up to 10.5% across multiple language model benchmarks compared to the state-of-the-art methods.
翻译:高效的数据选择对于加速大语言模型(LLM)的预训练至关重要。尽管已有多种方法被提出来提升数据效率,但如何解决这些方法之间的固有冲突以实现LLM预训练的最优数据选择,相关研究仍然有限。为解决这一问题,我们提出了一种新颖的多智能体协同数据选择机制。在该框架中,每种数据选择方法作为一个独立的智能体,并设计了一个智能体控制台,用于在LLM训练的整个过程中动态整合来自所有智能体的信息。我们进行了广泛的实证研究来评估我们的多智能体框架。实验结果表明,与最先进的方法相比,我们的方法显著提高了数据效率,加速了LLM训练的收敛,并在多个语言模型基准测试中实现了平均高达10.5%的性能提升。