In this empirical article, we introduce INNAprop, an optimization algorithm that combines the INNA method with the RMSprop adaptive gradient scaling. It leverages second-order information and rescaling while keeping the memory requirements of standard DL methods as AdamW or SGD with momentum. After giving geometrical insights, we evaluate INNAprop on CIFAR-10, Food101, and ImageNet with ResNets, VGG, DenseNet, and ViT, and on GPT-2 (OpenWebText) train from scratch and with LoRA fine-tuning (E2E). INNAprop consistently matches or outperforms AdamW both in training speed and accuracy, with minimal hyperparameter tuning in large-scale settings. Our code is publicly available at \url{https://github.com/innaprop/innaprop}.
翻译:在这篇实证文章中,我们提出了INNAprop优化算法,该算法将INNA方法与RMSprop自适应梯度缩放相结合。它利用二阶信息和重缩放技术,同时保持了如AdamW或带动量的SGD等标准深度学习方法的内存需求。在给出几何视角的解析后,我们在CIFAR-10、Food101和ImageNet数据集上使用ResNet、VGG、DenseNet和ViT架构,以及在GPT-2模型(OpenWebText语料)上从头训练和LoRA微调(E2E)场景中对INNAprop进行了评估。INNAprop在训练速度和准确率方面均持续匹配或超越AdamW,且在大规模设置中仅需极少的超参数调整。我们的代码公开于\url{https://github.com/innaprop/innaprop}。