Gefen: Optimized Stochastic Optimizer

AdamW is a default optimizer for modern deep learning, but its first and second moment states add roughly two parameter-sized buffers to training memory. We propose Gefen, a memory-efficient optimizer that automatically shares second-moment estimates across parameter blocks and quantizes the first moment using a learned codebook, thereby reducing AdamW's memory footprint by ~8x while maintaining the same performance, corresponding to a reduction of 6.5 GiB per billion parameters. The method is motivated by a theoretical result showing that large mixed Hessian entries constrain the ratio of squared gradients toward one, suggesting that Hessian-aligned parameters are natural candidates for sharing second-moment statistics. Since computing Hessians is impractical at scale, Gefen infers block structure from the initial squared gradients, requiring no architecture-specific metadata or hyperparameters beyond AdamW defaults. Gefen learns an exact histogram-based dynamic-programming quantization codebook and reuses the same blocks for first-moment scaling. Across diverse experiments, Gefen achieves the lowest peak optimizer memory among the compared AdamW-like methods while maintaining AdamW-level performance. In FSDP and DDP training, the reduced memory footprint enables larger microbatches and improves throughput significantly over AdamW, providing a practical drop-in replacement with lower memory usage that can increase throughput and enable training larger models or using larger batch sizes. We provide the complete Python implementation, including fused CUDA kernels at https://github.com/ndvbd/Gefen

翻译：AdamW是现代深度学习的默认优化器，但其一阶和二阶动量状态在训练时占用了约两倍参数量大小的缓冲区内存。我们提出Gefen，一种内存高效的优化器，它能自动在各参数块间共享二阶动量估计，并使用学习到的码本对一阶动量进行量化，从而在保持相同性能的同时，将AdamW的内存占用减少约8倍，相当于每十亿参数节省6.5 GiB。该方法的动机源自一个理论结果：海森矩阵大混合项会约束平方梯度之比趋于1，表明与海森矩阵对齐的参数天然适合共享二阶动量统计量。由于在大规模场景下计算海森矩阵不可行，Gefen从初始平方梯度推断块结构，无需除AdamW默认超参数外的任何架构特定元数据或超参数。Gefen学习一种基于精确直方图的动态规划量化码本，并复用相同块进行一阶动量缩放。在多样化实验中，Gefen在对比的所有类AdamW方法中实现了最低的峰值优化器内存，同时保持AdamW级别的性能。在FSDP和DDP训练中，减少的内存占用支持更大微批次，并显著提升相对于AdamW的吞吐量，提供了低内存占用的实用即插即用替代方案，可提高吞吐量、支持更大模型训练或使用更大批次尺寸。我们提供完整Python实现，包括融合CUDA内核，地址为https://github.com/ndvbd/Gefen。