New Optimizer SGD-SaI Challenges Adam in Deep Learning Training

Optimizer in Focus: New Research Questions Adam

The world of deep learning is in constant motion. A new research article titled "No More Adam: Learning Rate Scaling at Initialization is All You Need" is currently causing discussions. The authors, Minghao Xu, Lichuan Xiang, Xu Cai, and Hongkai Wen, question the necessity of adaptive gradient methods, such as Adam, for training neural networks and present an alternative approach: SGD-SaI.

SGD-SaI: Scaling at Initialization

SGD-SaI is based on the well-known Stochastic Gradient Descent with Momentum (SGDM) and extends it with a crucial feature: scaling the learning rate at initialization (SaI). The core of this method is the consideration of the gradient signal-to-noise ratio (g-SNR) of the individual parameter groups. By adjusting the learning rates without resorting to adaptive second-order calculations, SGD-SaI aims to prevent training imbalances from the outset while halving the memory requirements of the optimizer compared to AdamW.

Performance Comparison

The authors present results suggesting that SGD-SaI matches or even surpasses AdamW on various transformer-based tasks. In particular, when training Vision Transformers (ViT) for ImageNet-1K classification and pre-training GPT-2 for large language models (LLMs), SGD-SaI shows compelling performance. The new optimizer is also said to outperform existing approaches when fine-tuning LLMs with LoRA and training diffusion models. An important aspect is memory efficiency: SGD-SaI significantly reduces the memory required for optimizer states. Compared to AdamW, 5.93 GB are saved with GPT-2 (1.5 billion parameters) and even 25.15 GB with Llama2-7B when training with full precision.

Implications for Practice

The results of this study could have far-reaching consequences for the training of large neural networks. Especially in the context of resource-intensive applications, such as the training of LLMs, memory efficiency is a crucial factor. If SGD-SaI proves itself in further studies, this could lead to a shift away from adaptive optimizers like Adam. The simplicity and efficiency of SGD-SaI make it an attractive candidate for future deep learning projects. Especially for companies like Mindverse, which specialize in AI-powered content creation and customized AI solutions, such developments are of great interest. More efficient optimizers enable the training of more complex models and accelerate the development of innovative AI applications.

Outlook and Further Research

Although the results are promising, further research is needed to comprehensively evaluate the strengths and weaknesses of SGD-SaI. Comparisons with other optimization methods, tests on various architectures, and a more detailed analysis of the effects of g-SNR-based learning rate scaling are essential. The community eagerly awaits further studies that demonstrate the robustness and applicability of SGD-SaI in practice.

```