Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.
翻译:高效且稳定地训练大型语言模型(LLMs)仍然是现代机器学习系统的核心挑战。为应对这一挑战,已有研究提出了重参数化正交等价训练(POET),这是一种通过正交等价变换优化每个权重矩阵的谱保持框架。尽管POET提供了强大的训练稳定性,但其原始实现因密集的矩阵乘法导致较高的内存消耗和计算开销。为克服这些限制,我们提出了POET-X,一种可扩展且内存高效的变体,能以显著降低的计算成本执行正交等价变换。POET-X在保持POET的泛化性和稳定性优势的同时,在吞吐量和内存效率方面实现了显著提升。在我们的实验中,POET-X能够在单个Nvidia H100 GPU上预训练数十亿参数的LLMs;相比之下,在相同设置下,AdamW等标准优化器则会出现内存不足的情况。