Online convex optimization#
Posted on 2019-11-02
Disclaimer
We will require some familiarity with convex optimization.
What game are we playing?#
The player is given a fixed compact convex decision set \(\mathcal X\). At each round \(t=1,...,T\):
Player pick \(\mathbf x_t \in \mathcal X\).
The environment picks a convex loss function \(f_t: \mathcal X \rightarrow [0,1]\).
The player then gets access to \(f_t\) and suffers loss \(f_t(\mathbf x_t)\).
We want to have some notion of whether we are doing well. In the above setting simply using \(\sum_{t=1}^T f_t(\mathbf x_t)\) we could never do well since the environment can pick a new adversarial function \(f_t\) every time. Instead we compare ourselves with the best fixed strategy \(\mathbf x\):
(Regret)
The regret of the player is defined as
This is called the regret as it captures how badly we did in hindsight.1Ideally we want \(\mathcal R_T = o(T)\) since then \(lim_{T\rightarrow \infty} \mathcal R_T/T = 0\).
Follow the Leader (FTL)#
To get a feel let us first try a naive approach. A straight forward idea is what goes by the name of Follow The Leader (FTL) in the machine learning community and ficticious play in economics. We picks
However, there is a simple adversarial strategy against this algorithm for the environment. Consider \(\mathcal X = \set{-1,1}\) then the environment picks
So when minimizing the following at every \(t\) we have
We will end up always picking the \(\mathbf x_{t+1}\) that leads to the maximum regret \(f_{t+1}(\mathbf x_{t+1})\). So we achieve the lower bound of the regret. Can we do better?
Be The Leader (BLT)#
Let’s say that we could cheat and play \(\mathbf x_{t+1}\) at time \(t\) – thus avoiding the adversary. For notational convenience we still keep the same definition:
However, now in the regret analysis we incur \(f_t(\mathbf x_{t+1})\) instead of \(f_t(\mathbf x_t)\). We can now redo the regret analysis:
Not surprisingly BTL’s regret is negative. But it provides an interesting insight:
The above is just a simplification of \(\mathcal R_T^{\mathrm{FTL}} - 0 \leq \mathcal R_T^{\mathrm{FTL}} - \mathcal R_T^{\mathrm{BTL}}\) which is immediately obtained by application of the BTL bound.
So \(\mathcal R_T^{\mathrm{FTL}}\) is no worse than the difference between \(f_t(\mathbf x_t)\) and \(f_t(\mathbf x_{t+1})\). Intuitively a solution to our problem with FTL is then to make sure the strategy \(\mathbf x_t\) does not change too much – in other words: we want to regularize / stabilize!
Follow The Regularized Leader (FTRL)#
We modify our selection
where \(\eta > 0\) is the learning rate and \(\psi: \mathcal X \rightarrow \mathbb R\) is a regularizer.
The regularizer is there to ensure that our choice does not change too much between iterations. To capture this we will rely on \(\psi\) being strongly convex. That is,
for some norm \(||\cdot||\).
Stability#
This will allow us to prove the following notion of stability.
(Stability)
Stability says that
with \(\|\cdot\|_*\) is the dual norm of \(\|\cdot\|\) defined as \(\|\mathbf x\|_{*}=\sup _{\|\mathbf y\| \leqslant 1} \mathbf x^{\top} \mathbf y\).
Proof. Let’s define the function we minimizes in the decision selection
By strong convexity of \(\psi\) and convexity of \(f_\tau\) we have that 2This will ultimately let us bound \(f_t(\textbf x_t) - f_t(\textbf x_{t+1})\).
FTRL defines \(\mathbf x_{t}=\argmin_{\mathbf{x}} F_{t-1}(\mathbf x)\) so \(F_{t-1}(\mathbf x_t)\) is the optimum and first order characterization3First order characterization of convexity says \(f(\mathbf y) - f(\mathbf x) \leq \nabla f(\mathbf y)^\top (\mathbf y - \mathbf x)\). So when \(f(\mathbf y)\) is minimum we have \(\nabla f(\mathbf y)^\top (\mathbf y - \mathbf x) \geq 0\). becomes
So a weaker bound of (1) is
The second line simply applies the same reasoning. We can now sum the two lines and expand the definition of \(F_i\). All terms will cancel out except the last term in \(F_t(\mathbf x_{t+1})\) and \(F_t(\mathbf x_{t})\) so we get
Now we have the tools to bound the change.
Solving for \(f_t(\mathbf x_t) -f_t(\mathbf x_{t+1})\) we get
Equipped with stability and the BTL lemma we almost directly obtain the regret bound for FTRL.
Regret bound for FTRL#
With learning rate \(\eta\) we have
where \(D=\max_{\mathbf x} \psi(\mathbf x)-m_{\mathbf x} n \varphi(\mathbf x)\).
Proof. By defining \(f_0(\mathbf x) = \frac 1\eta \psi(\mathbf x)\) we can write the regularized selection as
First note that by the BTL lemma we have4Note that we additionally use \(f_0(\mathbf x^*) \geq f_0(\mathbf x_1)\) since BTL only applies to \(t=1,...,T\).
with
Now we can bound the FTRL regret as follows5We want to use stability so we need to move from \(\mathbf x^*\) to \(\mathbf x_{t+1}\). We do this by getting it on a form on which we can apply the BTL lemma.
Further, if \(f_t(\mathbf x)\) is \(G\)-Lipschitz constraint (so that the gradient is bounded \(\| \nabla_t f_t(\mathbf x_t)\|_* \leq G\)) and the optimal learning rate \(\eta = \sqrt{\frac {D}{TG^2}}\) is chosen, then we have
The optimal \(\eta\) is found by simply optimizing the above regret bound which is of the form:
FTRL with linear losses#
At this point we have good regret bounds but at every iteration \(t\) we have to minimize a sum of \(t\) convex functions. This simplifies significantly if we assume that the loss is linear \(f_{t}(\mathbf x)=\left\langle \mathbf g_{t}, \mathbf x\right\rangle\) since we can push in the summation:
Now we only have to keep track of a sum of gradients \(\mathbf u_t = \sum_{\tau=1}^T \mathbf g_\tau\) and the minimization problem reduces to a sum of a linear function and the strongly convex regularizer. So we can write the program as:
(FTRL with linear loss)
\(\mathbf x_{t+1}=\argmin_{\mathbf x}\left\langle \mathbf u_{t}, \mathbf x\right\rangle+\frac{1}{\eta} \psi(\mathbf x)\)
\(\mathbf u_{t+1}=\mathbf u_{t}+\mathbf g_{t}\)
Linear loss is sufficient#
It turns out that we can assume linear loss without loss of generality! That is, we can get the same regret bound for convex loss functions even if we run the above algorithm only using the gradient \(\mathbf g_t = \nabla f_t(\mathbf x_t)\) at each \(\mathbf x_t\). To see this, consider
So we can bound the regret for \(f_t(\mathbf x)\) by bounding the regret for the new problem which only considers its gradient \(f'_t(\mathbf x)=\braket{\nabla f_t(\mathbf x_t), \mathbf x}\). The bound stays as tight since we already used convexity in our argument for stability.
These notes are primarily based on Prof. Luo’s lectures notes [CSCI 699: Introduction to Online Learning, 2019] with minor modifications borrowed from Hazan’s book [Hazan, 2019] and the lecture notes of Prof. Orabona [Introduction to Online Learning, 2019].
- Int19
Introduction to Online Learning. 2019. URL: https://parameterfree.com/lecture-notes-on-online-learning/ (visited on 2019-11-02).
- CSC19
CSCI 699: Introduction to Online Learning. 2019. URL: https://haipeng-luo.net/courses/CSCI699/ (visited on 2019-11-02).
- Haz19
Elad Hazan. Introduction to Online Convex Optimization. 2019. URL: http://arxiv.org/abs/1909.05207 (visited on 2019-11-02), arXiv:1909.05207.