Why Frobenius weight normalization is ok#
Posted on 2026-05-13
It has been proposed to normalize weights by the Frobenius norm even when the underlying optimizer is a non-Euclidean method such as Muon. Is this justified? Do we preserve convergence guarantees?
TL;DR: If a matrix weight is immediately followed by RMSNorm, then globally rescaling that matrix does not change the function value. Frobenius normalization only performs such a global rescaling, so the descent inequality is preserved.
To make this precise, consider a matrix weight \(W\) followed by RMSNorm, so the layer output depends on \( \operatorname{RMSNorm}(Wx). \) For simplicity, take the idealized \(\epsilon=0\) version of RMSNorm and ignore any learned gain.
Now consider a steepest descent update for \(W\) based on a norm \(\|\cdot\|\). For a matrix \(G\), define the sharp operator by
and define the projection onto the Frobenius sphere \(\mathcal S_F(\rho)=\{S:\|S\|_F=\rho\}\) by
where the last line holds for \(W\neq0\).
Then we can write steepest descent (in the possibly non-Euclidean norm \(\|\cdot\|\)) followed by a Euclidean projection as:
For any positive scalar \(c>0\), \( \operatorname{RMSNorm}(c\,Wx) = \operatorname{RMSNorm}(Wx). \) Taking \(c=\rho/\|\widetilde W^{k+1}\|_F\), the normalized layer output is thus unchanged:1The same invariance can also hold for weights that are not immediately followed by RMSNorm. For example, in an MLP block of the form \(\operatorname{RMSNorm}(W_2 \sigma(W_1x))\), scaling \(W_1\) by a positive constant leaves the mapping unchanged as long as the activation function \(\sigma\) is positively homogeneous, as is the case for ReLU and ReLU\(^2\).
So the Frobenius projection preserves the function value of a normalized layer:
The descent analyses of steepest descent reason through function values by showing that
Since the subsequent Frobenius projection does not change the function value, the projected iterate \(W^{k+1}=\Pi_{\mathcal S_F(\rho)}(\widetilde W^{k+1})\) satisfies
So, at least in the scale-invariant setting due to layer normalization, the Frobenius projection does not break the descent lemma. It just chooses a representative of the same function with controlled Frobenius norm. In fact, there is nothing special about Frobenious weight normalization - the same argument holds true for any kind of normalization.
Cite this post#
@misc{pethick2026frobeniusnormalization,
author = {Thomas Pethick},
title = {Why Frobenius weight normalization is ok},
year = {2026},
month = {05},
day = {13},
url = {https://pethick.dk/posts/2026-05-13-frobenius-normalization/},
note = {Blog post}
}