Bilevel optimization has shown its utility across various machine learning settings, yet most algorithms in practice require second-order information, making it challenging to scale them up. Only recently, a paradigm of first-order algorithms emerged, capable of effectively addressing bilevel optimization problems. Nevertheless, the practical efficiency of this paradigm remains unverified, particularly in the context of large language models (LLMs). This paper introduces the first scalable instantiation of this paradigm called ScaleBiO, focusing on bilevel optimization for large-scale LLM data reweighting. By combining with a recently proposed memory-efficient training technique called LISA, our novel algorithm allows the paradigm to scale to 34-billion-parameter LLMs on eight A40 GPUs, marking the first successful application of bilevel optimization under practical scenarios for large-sized LLMs. Empirically, extensive experiments on data reweighting verify the effectiveness of ScaleBiO for different-scaled models, including GPT-2, LLaMA-3-8B, GPT-NeoX-20B, and Yi-34B, where bilevel optimization succeeds in filtering irrelevant data samples and selecting informative samples. Theoretically, ScaleBiO ensures the optimality of the learned data weights, along with a convergence guarantee matching the conventional first-order bilevel optimization paradigm on smooth and strongly convex objectives.
翻译:双层优化已在多种机器学习场景中展现出其效用,然而实践中大多数算法需要二阶信息,这使其难以扩展至大规模问题。直到最近,一类能够有效求解双层优化问题的一阶算法范式才得以出现。然而,该范式的实际效率仍未得到验证,尤其在大语言模型(LLMs)的背景下。本文提出了该范式的首个可扩展实例化方法ScaleBiO,专注于面向大规模LLM数据重赋权的双层优化。通过结合近期提出的内存高效训练技术LISA,我们提出的新算法使该范式能够在八块A40 GPU上扩展至340亿参数的LLMs,这标志着双层优化在实用场景下首次成功应用于大规模LLMs。实证方面,在数据重赋权任务上的大量实验验证了ScaleBiO在不同规模模型(包括GPT-2、LLaMA-3-8B、GPT-NeoX-20B和Yi-34B)上的有效性,其中双层优化成功过滤了无关数据样本并选择了信息丰富的样本。理论方面,ScaleBiO确保了所学数据权值的最优性,并在光滑强凸目标函数上提供了与传统一阶双层优化范式相匹配的收敛性保证。