Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, their immense number of parameters and complex transformer-based architectures result in significant resource demands and computational complexity during training, making it challenging to optimize them efficiently on large datasets. To reduce training costs while preserving performance, researchers have investigated coreset selection techniques, which aim to identify small, representative subsets of the entire training dataset to accelerate LLM training. However, existing coreset selection methods fail to adapt to the dynamic nature of LLM training and often struggle with scalability for models of this size. To address these limitations, we propose a graph-guided adaptive and dynamic coreset selection framework for LLMs, namely GRACE. GRACE dynamically constructs and updates coresets by combining representation diversity with gradient-based importance metrics, ensuring both informativeness and efficiency. To mitigate the computational cost of frequent updates, GRACE leverages a $k$-NN graph-based propagation mechanism and selectively updates scores and embeddings, adapting to evolving training dynamics. Extensive experiments on three benchmarks demonstrate that GRACE significantly improves training efficiency and downstream performance across diverse LLMs and tasks.
翻译:大语言模型在自然语言理解与生成方面展现了卓越能力。然而,其庞大的参数量与基于Transformer的复杂架构导致训练过程中需要大量资源与计算复杂度,使得在大规模数据集上高效优化面临挑战。为在降低训练成本的同时保持性能,研究者探索了核心集选择技术,旨在从完整训练数据中识别小型代表性子集以加速大语言模型训练。然而,现有核心集选择方法难以适应大语言模型训练的动态特性,且在处理此类规模模型时存在扩展性瓶颈。针对上述局限,我们提出面向大语言模型的图引导自适应动态核心集选择框架GRACE。GRACE通过结合表示多样性与基于梯度的重要性指标,动态构建并更新核心集,确保信息性与效率。为缓解频繁更新带来的计算开销,GRACE利用k近邻图传播机制,选择性更新分数与嵌入表示以适应训练动态变化。在三个基准上的广泛实验表明,GRACE能够显著提升不同大语言模型及其任务场景下的训练效率与下游性能。