One change at a time: how to compare SGD, signSGD, spectral descent, LARS, etc?

One change at a time: how to compare SGD, signSGD, spectral descent, LARS, etc?#

Posted on 2025-07-06

When comparing SGD and signSGD we are making two changes at once (possibly without realizing):

we are changing the geometry
and we are normalizing the update.

To see this, it is useful to write (normalized) steepest descent in terms of the linear minimization oracle (LMO):

\[\operatorname{lmo}(d) \in \argmin_{x \in \mathcal D} \langle d,x\rangle\]

where the constraint is the norm ball, \(\mathcal D = \{x \mid \Vert x \Vert \leq 1\}\). We have the following special cases:

Method	Norm	Steepest descent	Normalized steepest descent
General	\(\Vert \cdot\Vert\)	\(x^{k+1}=x^k + \gamma_k\Vert d^k\Vert_* \operatorname{lmo}(d^k)\)	\(x^{k+1}=x^k + \gamma_k \operatorname{lmo}(d^k)\)
Gradient descent	\(\Vert \cdot\Vert_2\)	\(\color{red}x^{k+1}=x^k - \gamma_k d^k\)	\(x^{k+1}=x^k - \gamma_k \tfrac{d^k}{\Vert d^k\Vert_2}\)
Sign descent	\(\Vert \cdot\Vert_\infty\)	\(x^{k+1}=x^k - \gamma_k\Vert d^k\Vert_1\operatorname{sign}(d^k)\)	\(\color{red}x^{k+1}=x^k - \gamma_k\operatorname{sign}(d^k)\)
Spectral descent	\(\Vert \cdot\Vert_{\mathcal{S}_\infty}\)	\(x^{k+1}=x^k - \gamma_k\Vert d^k\Vert_1 U^k(V^k)^\top\)	\(x^{k+1}=x^k - \gamma_k U^k(V^k)^\top\)

where the reduced SVD is given as \(d^k=U^k\operatorname{diag}(\sigma^k)(V^k)^\top\) and \(d^k\) is some gradient feedback.

SGD and signSGD are highlighted in red. We argue that a more natural comparison to signSGD is normalized SGD. In a similar vein, Muon should be compared with normalized GD, like spectral descent should be compared with GD.

Why do we care about disentangling the changes? Normalized methods have very different properties from their non-normalized variants:

one benefit of normalization is that the initial stepsize does not have to be taken small enough to ensure convergence (although we do need \(\gamma_k\rightarrow 0\)). In comparison steepest descent requires a small enough stepsize \(\gamma_k <2/L\), otherwise the method might diverge. Intuitively, the benefit comes from the normalization making the update invariant to the gradient magnitude.
one downside is that, due to the same invariance, normalization does not adapt to strong convexity/PL. In contrast, steepest descent enjoys linear convergence—intuitively, because steps are larger when farther from the solution, due to the updates dependency on the gradient magnitude.

For instance, for a simple quadratic problem like \(\min_x \Vert x\Vert^2_2\) we should expect steepest descent methods (GD, signSGD with dual norm scaling) to converge much faster than their normalized counterparts (NGD, signSGD) when stepsizes are set appropriately.

The fact that normalization works much better in practice for deep learning suggests that we are not in the strongly convex/PL case (or that we are at least currently prevented from exploiting such structure even if present).

Layerwise/block normalization#

Normalized updates alone have been observed to work very well for large batch sizes if updates are normalized layerwise (see e.g. LARS [You et al., 2017] and other block-normalization schemes [Ginsburg et al., 2019, Yu et al., 2017]).

Consider the case where the parameter \(x\) decomposes into \(n\) blocks, i.e., \(x=(W_1,...,W_n)\) and the gradient estimator \(d=(D_1,...,D_n)\). The block-normalized update implicitly chooses \(\Vert x\Vert = \max_l \Vert W_l \Vert_F\). If using global \(\ell_2\) normalization instead we are implicitly picking \(\ell_2\) over the layers, i.e. \(\Vert x\Vert_2 = \Vert (\Vert W_l\Vert_F)_l \Vert_2\).

Some popular norm choices:

Name	Norm	Normalized steepest descent
Euclidean	\(\Vert x\Vert_2 = \Vert (\Vert W_l\Vert_F)_l \Vert_2 \)	\(x^{k+1}=x^k - \gamma_k \tfrac{d^k}{\Vert d^k\Vert_2}\)
Layerwise Frobenius	\(\max_l \Vert W_l\Vert_F\)	\(W_l^{k+1}=W_l^k - \gamma_k \tfrac{D^k_i}{\Vert D^k_l\Vert_F}\)
Layerwise Sign	\(\max_l \Vert W_l\Vert_{1 \rightarrow \infty}=\max_l \max_{i,j} \|[W_l]_{i,j}\|\)	\(x^{k+1}=x^k - \gamma_k\operatorname{sign}(d^k)\)
Layerwise RowNorm	\(\max_l \Vert W_l\Vert_{2 \rightarrow \infty}=\max_l \max_i \Vert [W_l]_{i,\cdot}\Vert_2\)	\([W_l^{k+1}]_{i,\cdot}=[W_l^k]_{i,\cdot} - \gamma_k \tfrac{[D^k_i]_{i,\cdot}}{\Vert [D^k_l]_{i,\cdot}\Vert_2}\)
Layerwise Spectral	\(\max_l \Vert W_l\Vert_{2\rightarrow 2}\)	\(W_l^{k+1}=W_l^k - \gamma_k U^k_l(V^k_l)^\top\)

There are a couple of choices being made simultaneously here:

There is an implicit choice of the norm over layers. The global \(\ell_2\) implicitly picks \(\ell_2\) over layers while the remaining uses the max-norm over layers. One could similarly consider e.g., the spectral norm with \(\ell_2\) over layers, i.e., \(\Vert x\Vert = \Vert (\Vert W_l\Vert_{2\rightarrow 2})_l \Vert_2\). A comparison should ideally make one change at a time and for example compare layerwise Frobenius with layerwise spectral rather than with global \(\ell_2\) normalization.
There is an implicit choice of what type of matrix norm. Is it an operator norm? Is it a Schatten norm? This has implications in the stochastic case where Sign and Spectral should be expected to behave differently (More on this later).

One thing that can be gauged from above table is that RowNorm is a closer approximation to Spectral than Sign is. This might explain the popularity of row normalization in practice, e.g., in nGPT [Loshchilov et al., 2024], optimizers for diffusion models [Karras et al., 2024] and DASGO [An et al., 2025] (see the recent Kovalev [2025] for a connection with RowNorm).

Momentum based gradient estimator#

Another element in these methods is the choice of the gradient estimator \(d^k\), which turns out to be crucial for establishing convergence in the stochastic case.

For SGD, \(d^k\) is simply picked as \(d^k=\nabla f(x^k,\xi_k)\) for some randomly sampled minibatch \(\xi_k\). Both SGD and normalized SGD converges under standard assumptions (unbiased and bounded variance) [Hazan et al., 2015].

In the non-Euclidean case, convergence is unfortunately not guaranteed ¹ ¹That is, without trivially killing the noise with an increasing batchsize., since the LMO can be non-linear, so the update \(\operatorname{lmo}(d^k)\) can be biased as soon as \(d^k\) has variance, regardless of whether \(d^k\) itself is unbiased.

The issue can be resolved with the following estimator, which is sometimes referred to as a momentum estimator due to its connection with Polyak momentum in the Euclidean case:

(3)#\[ d^k = (1-\alpha_k)d^{k-1} + \alpha_k \nabla f(x^k,\xi_k) \]

Single sample There is an interesting exception where convergence can be established without the momentum estimator: the gradient of a weight matrix in an MLP computed over a single sample is rank-1 (see e.g., [Yang et al., 2023]).

In that case, all Schatten norms are equivalent (so Frobenius = Spectral = Nuclear), which means that spectral methods become equivalent to their \(\ell_2\) counterpart (e.g., Muon=normalized SGD with momentum). So spectral descent converges in the single-sample case, since SGD does! ² ²Strictly speaking, the equivalence holds when \(\ell_2\) is picked as the norm-choice over layers, i.e. \(\Vert (\Vert W_1\Vert_{\mathcal S_{\infty}}, ..., \Vert W_D\Vert_{\mathcal S_{\infty}})\Vert _2\).

In contrast, we do not have that equivalence between SignSGD and SGD under low-rank structure (the equivalence rather hold when the gradients are as sparse as possible, namely a 1-hot vector).

Why is this interesting? We should expect signSGD to not converge for batchsize=1, whereas spectral descent (with or without normalization) should in fact converge—and be equivalent to SGD.

The difference between averaging gradients and iterates Coming back to the momentum estimator, for non-Euclidean methods, the momentum in (3) (i.e., averaging past gradient) is crucial to establish convergence by reducing the variance. Averaging the iterates instead would not lead to a convergent method. ³ ³See e.g., the analysis of ALMOND in Pethick et al. [2025].

In the Euclidean case, averaging the gradients (dual feedback) as in (3) and averaging the iterates (primal variables) can be equivalent. E.g., Polyak momentum, can be written both in terms of averaging dual feedback (\(x^{k+1} = x^k - \gamma v^{k+1}\) where \(v^{k+1} = \sum_{k=0}^l \beta^k \nabla f(x^{l-k})\)) and in terms of primal averaging (see e.g. Defazio [2020], Tao et al. [2018]). This leads to two additional properties:

for the Euclidean and deterministic case, gradient descent with momentum applied to a convex quadratic the convergence rate is accelerated.
for the Euclidean and stochastic case, stochastic gradient descent with momentum enjoys tight last iterate rates [Defazio, 2020, Tao et al., 2018].

Even though the two mechanism, primal averaging and averaging dual feedback, can sometimes be equivalent, they generally have different properties.

Overall, to develop a better empirical understanding we should probably be more careful about changing one thing at a time.

ALP+25: Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, and Tong Zhang. Asgo: adaptive structured gradient optimization. arXiv preprint arXiv:2503.20762, 2025.
Def20(1,2): Aaron Defazio. Momentum via primal averaging: theoretical insights and learning rate schedules for non-convex optimization. arXiv preprint arXiv:2010.00406, 2020.
GCH+19: Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, Yang Zhang, and Jonathan M Cohen. Stochastic gradient methods with layer-wise adaptive moments for training of deep networks. arXiv preprint arXiv:1905.11286, 2019.
HLSS15: Elad Hazan, Kfir Levy, and Shai Shalev-Shwartz. Beyond convexity: stochastic quasi-convex optimization. Advances in neural information processing systems, 2015.
KAL+24: Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 24174–24184. 2024.
Kov25: Dmitry Kovalev. Sgd with adaptive preconditioning: unified analysis and momentum acceleration. 2025. URL: https://arxiv.org/abs/2506.23803, arXiv:2506.23803.
LHSG24: Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, and Boris Ginsburg. Ngpt: normalized transformer with representation learning on the hypersphere. arXiv preprint arXiv:2410.01131, 2024.
PXA+25: Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos. arXiv preprint arXiv:2502.07529, 2025.
TPWT18(1,2): Wei Tao, Zhisong Pan, Gaowei Wu, and Qing Tao. Primal averaging: a new gradient evaluation step to attain the optimal individual convergence. IEEE transactions on cybernetics, 50(2):835–845, 2018.
YSB23: Greg Yang, James B Simon, and Jeremy Bernstein. A spectral condition for feature learning. arXiv preprint arXiv:2310.17813, 2023.
YGG17: Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017.
YHL+17: Adams Wei Yu, Lei Huang, Qihang Lin, Ruslan Salakhutdinov, and Jaime Carbonell. Block-normalized gradient method: an empirical study for training deep neural network. arXiv preprint arXiv:1707.04822, 2017.

One change at a time: how to compare SGD, signSGD, spectral descent, LARS, etc?

Contents

One change at a time: how to compare SGD, signSGD, spectral descent, LARS, etc?#

Layerwise/block normalization#

Momentum based gradient estimator#