Polyak stepsize through a hyperplane projection interpretation#
Posted on 2024-06-10
The gradient method can obtain a \(\mathcal O(1/\sqrt{T})\) rate for nonsmooth problems and a \(\mathcal O(1/T)\) for smooth problems. However, the stepsize needs to be set differently (decreasing and constant respectively) and requires knowledge of the problem constants. In this blog post we will look at a very simple adaptive stepsize choice that circumvents these problems:
(Gradient method with Polyak stepsize)
Compute \(g^k \in \partial f(x^k)\)
Update \(x^{k+1} = x^k - \gamma_k g^k \quad \text{with}\quad \gamma_k = \tfrac{f(x^k) - f(x^\star)}{\| g^k\|^2}\)
The scheme only requires knowledge of the optimal function value \(f(x^\star)\). The adaptive stepsize choice \(\gamma_k\) might look mysterious at first, but as we will see promptly, it appears very naturally when constructing a halfspace containing our solutions.
Problem formulation#
We will work under the following assumption, which includes all convex functions.
(Star-convex)
A function \(f: \mathbb R^d \rightarrow \mathbb R\) is star-convex with respect to all \(x^\star \in \mathcal X^\star \subseteq \argmin_x f(x)\), if for all \(g \in \partial f(x)\) and \(x \in \mathbb R^d\)
Method#
Say we are given an iterate \(x^k \in \mathbb R^d\) and an associated subgradient \(g^k \in \partial f(x^k)\). We can construct a halfspace that contains the solution set \(\mathcal X^\star\):
It is easy to verify that the solution set is contained by observing that (3) reduces to Definition 3 when taking \(w=x^\star\).
If we can project onto \(\mathcal D(x^k)\) we know that we would make progress towards a solution (as long as \(\operatorname{fix} \boldsymbol \Pi_{\mathcal{D}(x)}(x) \subseteq \argmin_x f(x)\)). Fortunately, the projection unto a halfspace is simple.
The projection \(\boldsymbol \Pi_{\mathcal{C}}(x):=\argmin_{z \in \mathcal{C}} \|z-x\|^2\) onto the set \(\mathcal{C} = \{z \mid \braket{a,z} \geq b\}\) of \(x \not \in \mathcal{C}\) is given as,
Applying the above lemma to our particular case, by taking \(\mathcal C= \mathcal D(x^k)\) such that \(a=- g^k\) and \(b=-\braket{ g^k,x^k} + f(x^k) - f(x^\star)\), leads to:
when \(x^k \notin \mathcal D(x^k)\) and otherwise \(P(x^k)=x^k\). We can apply the projection iteratively, i.e. \(x^{k+1}=P(x^k)\). This is the (sub)gradient method with an adaptive stepsize provided in Algorithm 8. In other words we do not need to specify a stepsize!
Analysis#
Convergence analysis is straightforward, since we can rely on firmly quasi-nonexpansiveness of the projection to ensure convergence, i.e.
where \(w^\star\in \operatorname{fix} P\). We just need to convince ourselves that \(w^\star\) also provides a solution to our original problem, i.e. \(\operatorname{fix} P \subseteq \argmin_x f(x)\). If \(x^k \notin \mathcal D(x^k)\) then it is easy to see that the the subgradient \(g^k=0\) (which implies \(x^k\) being a minimizer). On the other hand, if \(x^k \in \mathcal D(x^k)\) then we know \(0 \geq f(x^k) - f(x^\star)\), which also implies \(x^k \in \argmin_x f(x)\). Thus, under star-convexity alone, nonasymptotic convergence is established.
Rates follows directly from lower bounding the stepsize \(\gamma_k\) as made precise in the following theorem.
Suppose Definition 3 holds for \(f: \mathbb R^d \rightarrow \mathbb R\). Then \((x^k)_{k\in \mathbb N}\) generated by Algorithm 8 satisfies the following for all \(x^\star \in \mathcal X^\star\)
If \(f\) has bounded subgradients, i.e. \(\|g^k\| \leq G\) for all \(k\in \mathbb N\), then
\[ \min_{k=0,...,K-1} f(x^k) - f(x^\star) \leq \tfrac{\| x^0 - x^\star \|}{G \sqrt{K}} \]If \(f\) has \(L\)-Lipschitz continuous gradients, then
\[ \min_{k=0,...,K-1} f(x^k) - f(x^\star) \leq \tfrac{\| x^0 - x^\star \|^2}{2L K} \]
Proof. By simply expanding the update rule we have
where the last line uses star-convexity (Definition 3). Using the definition of \(\gamma_k\) we establish the following descent inequality
What remains is to argue that the denominator \(\| g^k\|^2\) remains bounded. We proceed case by case.
For the nonsmooth case: By assumption, \(\|g^k\|\leq G\), so by telescoping (4) we immediately establish
The first claim follows by noting that the minimum is smaller than the average and that the square root can be brought inside the minimum due to monotonicity.
For the smooth case: By assumption \(\nabla f\) is \(L\)-Lipschitz continuous, which provides a better upper bound on the gradient norm, namely
Consequently, proceeding from (4) we have
Observing that the minimum is smaller than the average completes the proof.
Conclusion#
There are a couple of things we did not cover:
The Polyak stepsize can also adapt to strongly convexity, which is treated in Hazan and Kakade [2019].
The method requires knowledge of the optimal value, \(f(x^\star)\). Hazan and Kakade [2019] removes this requirement by paying a logarithmic factor in the complexity (provided we have a lower bound on \(f(x^\star)\)).
The stochastic case [Garrigos et al., 2023, Gower et al., 2021], which also uses a hyperplane projection approach.
See Ch. 5 in Boyd et al. [2003] for an example of the subgradient method with Polyak stepsize (Algorithm 8) applied to finding a point in the intersection of convex sets.
- BXM03
Stephen Boyd, Lin Xiao, and Almir Mutapcic. Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter, 2003. URL: https://stanford.edu/class/ee364b/lectures/subgrad_method_notes.pdf.
- GGS23
Guillaume Garrigos, Robert M Gower, and Fabian Schaipp. Function value learning: adaptive learning rates based on the polyak stepsize and function splitting in erm. arXiv preprint arXiv:2307.14528, 2023.
- GDR21
Robert M Gower, Aaron Defazio, and Michael Rabbat. Stochastic polyak stepsize with a moving target. arXiv preprint arXiv:2106.11851, 2021.
- HK19(1,2)
Elad Hazan and Sham Kakade. Revisiting the polyak step size. arXiv preprint arXiv:1905.00313, 2019.