Why Frobenius weight normalization is ok#
Posted on 2026-05-13
It has been proposed to normalize weights by the Frobenius norm even when the underlying optimizer is a non-Euclidean method such as Muon. Is this justified? Do we preserve convergence guarantees?
TL;DR: If a matrix weight is immediately followed by RMSNorm, then globally rescaling that matrix does not change the function value. Frobenius normalization only performs such a global rescaling, so the descent inequality is preserved.
To make this precise, consider a matrix weight \(W\) followed by RMSNorm, so the layer output depends on \( \operatorname{RMSNorm}(Wx). \) For simplicity, take the idealized \(\epsilon=0\) version of RMSNorm and ignore any learned gain.
Now consider a steepest descent update for \(W\) based on a norm \(\|\cdot\|\). For a matrix \(G\), define the sharp operator by
and define the projection onto the Frobenius sphere \(\mathcal S_F(\rho)=\{S:\|S\|_F=\rho\}\) by
where the last line holds for \(W\neq0\).
Then we can write Frobenius normalization followed by steepest descent in the possibly non-Euclidean norm \(\|\cdot\|\) as 1We could also write the update in terms of the linear minimization oracle (LMO). Let \(G_k=\nabla f(\widetilde W^k)\) and let \(\operatorname{lmo}(G_k)\in\operatorname*{arg\,min}_{\|S\|\le1}\langle G_k,S\rangle\). Then, \(G_k^\#=-\|G_k\|_*\operatorname{lmo}(G_k)\).:
For any positive scalar \(c>0\), \( \operatorname{RMSNorm}(c\,Wx) = \operatorname{RMSNorm}(Wx). \) Thus Frobenius normalization leaves the normalized layer output unchanged. 2The same invariance can also hold for weights that are not immediately followed by RMSNorm. For example, in an MLP block of the form \(\operatorname{RMSNorm}(W_2 \sigma(W_1x))\), scaling \(W_1\) by a positive constant leaves the mapping unchanged as long as the activation function \(\sigma\) is positively homogeneous, as is the case for ReLU and ReLU\(^2\). This property propagates to the loss \(f\):
Scale invariance. For every \(a>0\),
Also assume smoothness in the norm used to define the sharp operator:
Smoothness. The objective \(f\) is \(L\)-smooth with respect to \(\|\cdot\|\), meaning, for all \(X,Y\),
Theorem 4
Suppose scale invariance and \(L\)-smoothness hold. If \(0<\eta_k<2/L\), then the sequence \((\widetilde W^{k})_{k\in \mathbb N}\) satisfies
Proof. Apply \(L\)-smoothness with \(X=\widetilde W^k\) and \(Y=W^{k+1}=\widetilde W^k-\eta_k[\nabla f(\widetilde W^k)]^\#\). Writing \(G_k=\nabla f(\widetilde W^k)\), we get
where we used \(\langle G_k,G_k^\#\rangle=\|G_k^\#\|^2=\|G_k\|_*^2\), which follows from the optimality condition defining the sharp operator. Finally, scale invariance also gives \(f(\widetilde W^{k+1})=f(W^{k+1})\), and the result follows.
The theorem gives a descent inequality in the normalized iterates, which can then be telescoped in the usual way, whenever the stepsize satisfies \(0<\eta_k<2/L\).
So, at least in the scale-invariant setting due to layer normalization, Frobenius normalization does not break the descent lemma. It just chooses a representative of the same function with controlled Frobenius norm. The argument is not specific to the Frobenius norm, but it does rely on the normalization being a global positive rescaling, so that replacing \(W^k\) by \(\widetilde W^k\) does not change the function value.
Cite this post#
@misc{pethick2026frobeniusnormalization,
author = {Thomas Pethick},
title = {Why Frobenius weight normalization is ok},
year = {2026},
month = {05},
day = {13},
url = {https://pethick.dk/posts/2026-05-13-frobenius-normalization/},
note = {Blog post}
}