\[\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min}\]

Online convex optimization#

Posted on 2019-11-02

Disclaimer

We will require some familiarity with convex optimization.

What game are we playing?#

The player is given a fixed compact convex decision set \(\mathcal X\). At each round \(t=1,...,T\):

  1. Player pick \(\mathbf x_t \in \mathcal X\).

  2. The environment picks a convex loss function \(f_t: \mathcal X \rightarrow [0,1]\).

  3. The player then gets access to \(f_t\) and suffers loss \(f_t(\mathbf x_t)\).

We want to have some notion of whether we are doing well. In the above setting simply using \(\sum_{t=1}^T f_t(\mathbf x_t)\) we could never do well since the environment can pick a new adversarial function \(f_t\) every time. Instead we compare ourselves with the best fixed strategy \(\mathbf x\):

Definition 1 (Regret)

The regret of the player is defined as

\[\mathcal R_t = \sum_{t=1}^T f_t(\mathbf x_t) - \min_{\mathbf x} \sum_{t=1}^T f_t(\mathbf x).\]

This is called the regret as it captures how badly we did in hindsight.1Ideally we want \(\mathcal R_T = o(T)\) since then \(lim_{T\rightarrow \infty} \mathcal R_T/T = 0\).

Follow the Leader (FTL)#

To get a feel let us first try a naive approach. A straight forward idea is what goes by the name of Follow The Leader (FTL) in the machine learning community and ficticious play in economics. We picks

\[\mathbf x_t = \argmin_{\mathbf x} = \sum_{\tau=1}^{t-1} f_\tau (\mathbf x) \quad \text{(best strategy so far)}\]

However, there is a simple adversarial strategy against this algorithm for the environment. Consider \(\mathcal X = \set{-1,1}\) then the environment picks

\[\begin{split} \begin{split} f_1(\mathbf x) &= \frac 12 \quad \text{and} \\ f_t(\mathbf x) &\quad \text{alternates between $-\mathbf x$ and $\mathbf x$ for $t=2,...,T$}. \end{split} \end{split}\]

So when minimizing the following at every \(t\) we have

\[\begin{split} \sum_{\tau=1}^{t} f_\tau (\mathbf x) = \begin{cases} \frac 12 \mathbf x & t\ \text{is odd} \\ - \frac 12 \mathbf x & \text{otherwise} \end{cases} \end{split}\]

We will end up always picking the \(\mathbf x_{t+1}\) that leads to the maximum regret \(f_{t+1}(\mathbf x_{t+1})\). So we achieve the lower bound of the regret. Can we do better?

Be The Leader (BLT)#

Let’s say that we could cheat and play \(\mathbf x_{t+1}\) at time \(t\) – thus avoiding the adversary. For notational convenience we still keep the same definition:

\[\mathbf x_t = \argmin_{\mathbf x} = \sum_{\tau=1}^{t-1} f_\tau (\mathbf x).\]

However, now in the regret analysis we incur \(f_t(\mathbf x_{t+1})\) instead of \(f_t(\mathbf x_t)\). We can now redo the regret analysis:

\[\begin{split} \begin{aligned} \mathcal R_{T}^{\mathrm{BTL}} &=\sum_{t=1}^{T} f_{t}\left(\mathbf x_{t+1}\right)-\min _{\mathbf x} \sum_{t=1}^T f_{t}(\mathbf x) \\ &=\sum_{t=1}^{T} f_{t}\left(\mathbf x_{t+1}\right)-\sum_{t=1}^{T} f_{t}\left(\mathbf x_{T+1}\right) && \text{(by def of $\mathbf x_t$)}\\ &=\sum_{t=1}^{T-1} f_{t}\left(\mathbf x_{t+1}\right)-\sum_{t=1}^{T-1} f_{t}\left(\mathbf x_{T+1}\right) && \text{(last terms are equal $(*)$)}\\ & \leqslant \sum_{t=1}^{T-1} f_{t}\left(\mathbf x_{t+1}\right)-\sum_{t=1}^{T-1} f_{t}\left(\mathbf x_{T}\right) && \text{(since $\mathbf x_T$ is the actual minimum)}\\ & \leqslant f_{1}\left(\mathbf x_{2}\right)-f_{1}\left(\mathbf x_{3}\right) && \text{(recurse until $T-i=1$ in $(*)$)} \\ & \leqslant 0 && \text{(since $f_1(\mathbf x_2)$ is the minimum.)} \end{aligned} \end{split}\]

Not surprisingly BTL’s regret is negative. But it provides an interesting insight:

\[ \sum_{t=1}^T f_t(\mathbf x_t) - \min_{\mathbf{x}} \sum_{t=1}^T f_t(\mathbf x) \leq \sum_{t=1}^T (f_t(\mathbf x_t) - f_t(\mathbf x_{t+1})). \]

The above is just a simplification of \(\mathcal R_T^{\mathrm{FTL}} - 0 \leq \mathcal R_T^{\mathrm{FTL}} - \mathcal R_T^{\mathrm{BTL}}\) which is immediately obtained by application of the BTL bound.

So \(\mathcal R_T^{\mathrm{FTL}}\) is no worse than the difference between \(f_t(\mathbf x_t)\) and \(f_t(\mathbf x_{t+1})\). Intuitively a solution to our problem with FTL is then to make sure the strategy \(\mathbf x_t\) does not change too much – in other words: we want to regularize / stabilize!

Follow The Regularized Leader (FTRL)#

We modify our selection

\[ \mathbf x_{t}=\operatorname{argmin}_{\mathbf x} \sum_{t=1}^{T} f_{t}(\mathbf x)+\frac{1}{\eta} \psi(\mathbf x) \]

where \(\eta > 0\) is the learning rate and \(\psi: \mathcal X \rightarrow \mathbb R\) is a regularizer.

The regularizer is there to ensure that our choice does not change too much between iterations. To capture this we will rely on \(\psi\) being strongly convex. That is,

\[\psi(\mathbf x)-\psi(\mathbf y) \leqslant \nabla \psi(\mathbf x)^{T}(\mathbf x-\mathbf y)-\frac{1}{2}\|\mathbf x-\mathbf y\|^{2} \quad \forall \mathbf y, \mathbf x \in \mathcal X\]

for some norm \(||\cdot||\).

Stability#

This will allow us to prove the following notion of stability.

Lemma 1 (Stability)

Stability says that

\[f_{t}\left(\mathbf x_{t}\right)-f_{t}\left(\mathbf x_{t+1}\right) \leqslant \eta\left\|\nabla f_{t}\left(\mathbf x_{t}\right)\right\|_{*}^{2}\]

with \(\|\cdot\|_*\) is the dual norm of \(\|\cdot\|\) defined as \(\|\mathbf x\|_{*}=\sup _{\|\mathbf y\| \leqslant 1} \mathbf x^{\top} \mathbf y\).

2This will ultimately let us bound \(f_t(\textbf x_t) - f_t(\textbf x_{t+1})\).3First order characterization of convexity says \(f(\mathbf y) - f(\mathbf x) \leq \nabla f(\mathbf y)^\top (\mathbf y - \mathbf x)\). So when \(f(\mathbf y)\) is minimum we have \(\nabla f(\mathbf y)^\top (\mathbf y - \mathbf x) \geq 0\).

Proof. Let’s define the function we minimizes in the decision selection

\[ F_{t}(\mathbf x)=\sum_{\tau=1}^{t} f_{\tau}(\mathbf x)+\frac{1}{\eta} \psi(\mathbf x) \]

By strong convexity of \(\psi\) and convexity of \(f_\tau\) we have that 2This will ultimately let us bound \(f_t(\textbf x_t) - f_t(\textbf x_{t+1})\).

(1)#\[F_{t-1}\left(\mathbf x_{t}\right)-F_{t-1}\left(\mathbf x_{t+1}\right) \leqslant\left\langle\nabla F_{t-1}\left(\mathbf x_{t}\right), \mathbf x_{t}-\mathbf x_{t+1}\right\rangle-\frac{1}{2 \eta}\left\|\mathbf x_{t}-\mathbf x_{t+1}\right\|^{2} \quad \text{for any } \mathbf{x}_t, \mathbf{x}_{t+1} \in \mathcal X.\]

FTRL defines \(\mathbf x_{t}=\argmin_{\mathbf{x}} F_{t-1}(\mathbf x)\) so \(F_{t-1}(\mathbf x_t)\) is the optimum and first order characterization3First order characterization of convexity says \(f(\mathbf y) - f(\mathbf x) \leq \nabla f(\mathbf y)^\top (\mathbf y - \mathbf x)\). So when \(f(\mathbf y)\) is minimum we have \(\nabla f(\mathbf y)^\top (\mathbf y - \mathbf x) \geq 0\). becomes

\[ \left\langle\nabla F_{t-1}\left(\mathbf x_{t}\right), \mathbf x_{t}-\mathbf x_{t+1}\right\rangle \leqslant 0. \]

So a weaker bound of (1) is

\[\begin{split}\begin{aligned} F_{t-1}(\mathbf x_{t})-F_{t-1}(\mathbf x_{t+1}) & \leqslant-\frac{1}{2 \eta}\|\mathbf x_{t}-\mathbf x_{t+1}\|^{2} \quad \text { and } \\ F_{t}(\mathbf x_{t+1})-F_{t}(\mathbf x_{t}) & \leqslant-\frac{1}{2 \eta}\|\mathbf x_{t}-\mathbf x_{t+1}\|^{2}.\end{aligned} \end{split}\]

The second line simply applies the same reasoning. We can now sum the two lines and expand the definition of \(F_i\). All terms will cancel out except the last term in \(F_t(\mathbf x_{t+1})\) and \(F_t(\mathbf x_{t})\) so we get

(2)#\[\begin{split}\begin{aligned} f_{t}(\mathbf x_{t+1})-f_{t}(\mathbf x_{t}) \leqslant-\frac{1}{y}\|\mathbf x_{t}-\mathbf x_{t+1}\|^{2} &\Leftrightarrow \\ \|\mathbf x_{t}-\mathbf x_{t+1}\|^{2} \leqslant \eta(f_{t}(\mathbf x_{t})-f_{t}(\mathbf x_{t+1}))&.\end{aligned} \end{split}\]

Now we have the tools to bound the change.

\[\begin{split}\begin{aligned} f_{t}\left(\mathbf x_{t}\right)-f_{t}\left(\mathbf x_{t+1}\right) & \leqslant\left\langle\nabla f_{t}\left(\mathbf x_{t}\right), \mathbf x_{t}-\mathbf x_{t+1}\right\rangle && \text {(Convexity)} \\ & \leqslant\left\|\nabla f_{t}\left(\mathbf x_{t}\right)\right\|\left\|\mathbf x_{t}-\mathbf x_{t+1}\right\| && \text {(Hölder's ineq.)} \\ & \leqslant\left\|\nabla f_{t}\left(\mathbf x_{t}\right)\right\|_{*} \sqrt{\eta} \sqrt{F_{t}\left(\mathbf x_{t}\right)-f_{t}\left(\mathbf x_{t+1}\right)} && \text{(using prev. eq.)} \end{aligned}\end{split}\]

Solving for \(f_t(\mathbf x_t) -f_t(\mathbf x_{t+1})\) we get

\[f\left(\mathbf x_{t}\right)-f_{t}\left(\mathbf x_{t+1}\right) \leqslant \eta\left\|\nabla_{t} f\left(\mathbf x_{t}\right)\right\|_{*}^{2}\]

Equipped with stability and the BTL lemma we almost directly obtain the regret bound for FTRL.

Regret bound for FTRL#

Theorem 1

With learning rate \(\eta\) we have

\[\mathcal R_{T}^{\mathrm{FTRL}} \leqslant \frac{D}{\eta}+\eta \sum_{t=1}^{T}\left\|\nabla f_{t}\left(\mathbf x_{t}\right)\right\|_{*}^{2}\]

where \(D=\max_{\mathbf x} \psi(\mathbf x)-m_{\mathbf x} n \varphi(\mathbf x)\).

4Note that we additionally use \(f_0(\mathbf x^*) \geq f_0(\mathbf x_1)\) since BTL only applies to \(t=1,...,T\).5We want to use stability so we need to move from \(\mathbf x^*\) to \(\mathbf x_{t+1}\). We do this by getting it on a form on which we can apply the BTL lemma.

Proof. By defining \(f_0(\mathbf x) = \frac 1\eta \psi(\mathbf x)\) we can write the regularized selection as

\[\mathbf x_{t}=\argmin_{\mathbf x} \sum_{r=1}^{t-1} f_{\tau}(\mathbf x)+\frac{1}{\eta} \psi(\mathbf x)=\argmin_{\mathbf x} \sum_{\tau=0}^{t-1} f_{\tau}(\mathbf x)\]

First note that by the BTL lemma we have4Note that we additionally use \(f_0(\mathbf x^*) \geq f_0(\mathbf x_1)\) since BTL only applies to \(t=1,...,T\).

\[\sum_{t=0}^{T} f_{t}\left(\mathbf x^{*}\right) \geqslant \sum_{t=0}^{T} f_{t}\left(\mathbf x_{t+1}\right)\]

with

\[ w^{*}=\argmin_{\mathbf x} \sum_{t=1}^{T} f_{t}(\mathbf x) \quad \text{(best fixed expert)} \]

Now we can bound the FTRL regret as follows5We want to use stability so we need to move from \(\mathbf x^*\) to \(\mathbf x_{t+1}\). We do this by getting it on a form on which we can apply the BTL lemma.

\[\begin{split} \begin{aligned} \mathcal R_T^{\mathrm{FTRL}} &=\sum_{t=1}^{T} f_{t}\left(\mathbf x_{t}\right)-\sum_{t=1}^{T} f_{t}\left(\mathbf x^{*}\right) \\ &=\sum_{t=1}^{T} f_{t}\left(\mathbf x_{t}\right)-\sum_{t=1}^{t} f_{t}\left(\mathbf x^{*}\right)+f_{0}\left(\mathbf x^{*}\right) \\ & \leqslant \sum_{t=1}^{T} f_{t}\left(\mathbf x_{t}\right)-\sum_{t=0}^{T} f_{t}\left(\mathbf x_{t+1})+f_{0}\left(\mathbf x^{*}\right) \right. && \text{(using BTL lemma)}\\ & \leqslant \eta \sum_{t=1}^{T} \| \nabla_{t} f\left(\mathbf x_{t}\right) \|_{*}^{2}+f_{0}\left(\mathbf x^{*}\right)-f_{0}\left(\mathbf x_{1}\right) && (\text {using stability}) \\ & \leqslant \eta \sum_{t=1}^{T}\left\|\nabla_{t} f\left(\mathbf x_{t}\right)\right\|_{*}^{2}+\frac{1}{\eta}(\underbrace{\max _{\mathbf x} \psi(\mathbf x) - \min_{\mathbf x} \psi(\mathbf x)}_{D})\end{aligned}\end{split}\]

Corollary 1

Further, if \(f_t(\mathbf x)\) is \(G\)-Lipschitz constraint (so that the gradient is bounded \(\| \nabla_t f_t(\mathbf x_t)\|_* \leq G\)) and the optimal learning rate \(\eta = \sqrt{\frac {D}{TG^2}}\) is chosen, then we have

\[\mathcal R_T^{\mathrm{FTRL}} = \mathcal O (G\sqrt{TD}).\]

Remark 1

The optimal \(\eta\) is found by simply optimizing the above regret bound which is of the form:

\[\argmax _{n} a x+b x^{-1}=\frac{\sqrt{b}}{\sqrt{a}}.\]

FTRL with linear losses#

At this point we have good regret bounds but at every iteration \(t\) we have to minimize a sum of \(t\) convex functions. This simplifies significantly if we assume that the loss is linear \(f_{t}(\mathbf x)=\left\langle \mathbf g_{t}, \mathbf x\right\rangle\) since we can push in the summation:

\[\begin{split}\begin{aligned} \mathbf x_{t} &=\argmin_{\mathbf x} \sum_{\tau=1}^{t-1}\left\langle \mathbf g_{\tau}, \mathbf x\right\rangle+\frac{1}{\eta} \psi(\mathbf x) \\ &=\argmin_{\mathbf x} \left\langle\sum_{\tau=1}^{t-1} \mathbf g_\tau, \mathbf x\right\rangle+\frac{1}{\eta} \psi(\mathbf x). \end{aligned} \end{split}\]

Now we only have to keep track of a sum of gradients \(\mathbf u_t = \sum_{\tau=1}^T \mathbf g_\tau\) and the minimization problem reduces to a sum of a linear function and the strongly convex regularizer. So we can write the program as:

Algorithm 1 (FTRL with linear loss)

  1. \(\mathbf x_{t+1}=\argmin_{\mathbf x}\left\langle \mathbf u_{t}, \mathbf x\right\rangle+\frac{1}{\eta} \psi(\mathbf x)\)

  2. \(\mathbf u_{t+1}=\mathbf u_{t}+\mathbf g_{t}\)

Linear loss is sufficient#

It turns out that we can assume linear loss without loss of generality! That is, we can get the same regret bound for convex loss functions even if we run the above algorithm only using the gradient \(\mathbf g_t = \nabla f_t(\mathbf x_t)\) at each \(\mathbf x_t\). To see this, consider

\[\begin{split}\begin{aligned} \mathcal R_{T}^{\mathrm{FTRL}} & \leqslant \max _{\mathbf x} \sum_{t=1}^{T} f_{t}\left(u_{t}\right)-f_{t}(\mathbf x) \\ & \leqslant \max _{\mathbf x} \sum_{t=1}^{T}\left\langle\nabla f_{t}\left(\mathbf x_{t}\right), \mathbf x_{t}-\mathbf x\right\rangle. && \text{(Convexity applied to each term)}\end{aligned}\end{split}\]

So we can bound the regret for \(f_t(\mathbf x)\) by bounding the regret for the new problem which only considers its gradient \(f'_t(\mathbf x)=\braket{\nabla f_t(\mathbf x_t), \mathbf x}\). The bound stays as tight since we already used convexity in our argument for stability.

These notes are primarily based on Prof. Luo’s lectures notes [CSCI 699: Introduction to Online Learning, 2019] with minor modifications borrowed from Hazan’s book [Hazan, 2019] and the lecture notes of Prof. Orabona [Introduction to Online Learning, 2019].


Int19

Introduction to Online Learning. 2019. URL: https://parameterfree.com/lecture-notes-on-online-learning/ (visited on 2019-11-02).

CSC19

CSCI 699: Introduction to Online Learning. 2019. URL: https://haipeng-luo.net/courses/CSCI699/ (visited on 2019-11-02).

Haz19

Elad Hazan. Introduction to Online Convex Optimization. 2019. URL: http://arxiv.org/abs/1909.05207 (visited on 2019-11-02), arXiv:1909.05207.