Various ways of writing Nesterov’s acceleration#
Posted on 2020-06-04
Nestorov’s acceleration is usually presented in the following form (AGM1)
where \(\beta\) is the gradient Lipschitz parameter of \(f\).
The update is known to be notoriously opaque. There are alternatives ways of rewriting the first-order scheme, however, two of which we will cover here.
Acceleration from primal dual perspective#
We want to rewrite AGM1 as the following scheme (called AGM2)
where \(f\) is \(\beta\)-smooth, \(\eta_{t}=\frac{t+1}{2 \beta}\) and \(\tau_{t}=\frac{2}{t+2}\). This is a linear coupling of a small step size update (\(y_{t+1}\)) and a large step size update (\(z_{t+1}\)).
Derivation Notice that the definition for \(y_{t+1}\) is already the same which leaves us with verifying the update for \(x_{t+1}\) and \(z_{t+1}\). Let’s start by assuming that the correct update for \(x_t\) exists
This leaves us with checking whether this leads to the right definition of \(z_t\). That is, we need to show that \(z_{t+1} - z_{t}=-\eta_{t} \nabla f\left(x_{t}\right)\) according to the AGM2 update. First we make a guess about the relationship between parameters,
Using this after isolating \(z_t\) in (6) gives us,
Expanding \(z_{t+1}-z_{t}\)
Using definition of \(x_t\) in AGM1,
We can cancel terms by subtracting it
We can further develop by using the update for \(y_{t+1}\),
What is \(\frac{\lambda_{t}}{\beta}\)? If it is \(\frac{\lambda_{t}}{\beta}=\eta_{t}\) then we are done.
So we are weirdly enough off by one since it should be \(\eta_t\)…
Source: Bansal and Gupta [2017] and p. 468 of Drori and Teboulle [2014].
As momentum#
NAG can be written as a momentum scheme where the gradient is queried at “a future point”.
Derivation We start with AGM1
Now, naturally define the momentum as \(v_{t+1} = y_{t+1} - y_t\) and the parameter \(\alpha_{t+1}=\frac{\lambda_{t}-1}{\lambda_{t+1}}\) for convenience. This lets us write the above as,
This definition can be used in the update for \(y_{t+1}\)
Finally using this form of \(y_{t+1}\) in the definition for \(v_{t+1}\) we obtain the desired momentum update
Simply writing \(y_{t+1}\) using the definition of \(v_{t+1}\) completes the scheme
Source: [Sutskever et al., 2013].
- BG17
Nikhil Bansal and Anupam Gupta. Potential-function proofs for first-order methods. arXiv preprint arXiv:1712.04581, 2017. URL: https://arxiv.org/pdf/1712.04581.pdf.
- DT14
Yoel Drori and Marc Teboulle. Performance of first-order methods for smooth convex minimization: a novel approach. Mathematical Programming, 145(1-2):451–482, 2014. URL: https://link.springer.com/content/pdf/10.1007/s10107-013-0653-0.pdf.
- SMDH13
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, 1139–1147. PMLR, 2013. URL: http://proceedings.mlr.press/v28/sutskever13.pdf.