Online Convex Optimization

Disclaimer: we will require some familiarity with convex optimization.

1 What game are we playing?

The player is given a fixed compact convex decision set \mathcal X. At each round t=1,...,T:

  1. Player pick \mathbf x_t \in \mathcal X.
  2. The environment picks a convex loss function f_t: \mathcal X \rightarrow [0,1].
  3. The player then gets access to f_t and suffers loss f_t(\mathbf x_t).

We want to have some notion of whether we are doing well. In the above setting simply using \sum_{t=1}^T f_t(\mathbf x_t) we could never do well since the environment can pick a new adversarial function f_t every time. Instead we compare ourselves with the best fixed strategy \mathbf x:

Regret is defined as

\mathcal R_t = \sum_{t=1}^T f_t(\mathbf x_t) - \min_{\mathbf x} \sum_{t=1}^T f_t(\mathbf x).

This is called the regret as it captures how badly we did in hindsight Ideally we want \mathcal R_T = o(T) since then lim_{T\rightarrow \infty} \mathcal R_T/T = 0.

.

2 Follow the Leader (FTL)

To get a feel let’s first try a naive approach. A straight forward idea is what goes by the name of Follow The Leader (FTL) in the machine learning community and ficticious play in economics. We picks

\mathbf x_t = \operatorname*{arg\,min}_{\mathbf x} = \sum_{\tau=1}^{t-1} f_\tau (\mathbf x) \quad \text{(best strategy so far)}

However, there is a simple adversarial strategy against this algorithm for the environment. Consider \mathcal X = \set{-1,1} then the environment picks

 \begin{split} f_1(\mathbf x) &= \frac 12 \quad \text{and} \\ f_t(\mathbf x) &\quad \text{alternates between $-\mathbf x$ and $\mathbf x$ for $t=2,...,T$}. \end{split}

So when minimizing the following at every t we have

 \sum_{\tau=1}^{t} f_\tau (\mathbf x) = \begin{cases} \frac 12 \mathbf x & t\ \text{is odd} \\ - \frac 12 \mathbf x & \text{otherwise} \end{cases}

We will end up always picking the \mathbf x_{t+1} that leads to the maximum regret f_{t+1}(\mathbf x_{t+1}). So we achieve the lower bound of the regret. Can we do better?

2.1 Be The Leader (BLT)

Let’s say that we could cheat and play \mathbf x_{t+1} at time t — thus avoiding the adversary. For notational convenience we still keep the same definition:

\mathbf x_t = \operatorname*{arg\,min}_{\mathbf x} = \sum_{\tau=1}^{t-1} f_\tau (\mathbf x).

However, now in the regret analysis we incur f_t(\mathbf x_{t+1}) instead of f_t(\mathbf x_t). We can now redo the regret analysis:

 \begin{aligned} \mathcal R_{T}^{\mathrm{BTL}}    &=\sum_{t=1}^{T} f_{t}\left(\mathbf x_{t+1}\right)-\min _{\mathbf x} \sum_{t=1}^T f_{t}(\mathbf x) \\    &=\sum_{t=1}^{T} f_{t}\left(\mathbf x_{t+1}\right)-\sum_{t=1}^{T} f_{t}\left(\mathbf x_{T+1}\right) && \text{(by def of $\mathbf x_t$)}\\    &=\sum_{t=1}^{T-1} f_{t}\left(\mathbf x_{t+1}\right)-\sum_{t=1}^{T-1} f_{t}\left(\mathbf x_{T+1}\right) && \text{(last terms are equal $(*)$)}\\    & \leqslant \sum_{t=1}^{T-1} f_{t}\left(\mathbf x_{t+1}\right)-\sum_{t=1}^{T-1} f_{t}\left(\mathbf x_{T}\right) && \text{(since $\mathbf x_T$ is the actual minimum)}\\   & \leqslant f_{1}\left(\mathbf x_{2}\right)-f_{1}\left(\mathbf x_{3}\right) && \text{(recurse until $T-i=1$ in $(*)$)} \\    & \leqslant 0 && \text{(since $f_1(\mathbf x_2)$ is the minimum.)} \end{aligned}

Not surprisingly BTL’s regret is negative. But it provides an interesting insight:

 \sum_{t=1}^T f_t(\mathbf x_t) - \min_{\mathbf{x}} \sum_{t=1}^T f_t(\mathbf x) \leq \sum_{t=1}^T (f_t(\mathbf x_t) - f_t(\mathbf x_{t+1})).

The above is just a simplification of \mathcal R_T^{\mathrm{FTL}} - 0 \leq \mathcal R_T^{\mathrm{FTL}} - \mathcal R_T^{\mathrm{BTL}} which is immediately obtained by application of the BTL bound.

So \mathcal R_T^{\mathrm{FTL}} is no worse than the difference between f_t(\mathbf x_t) and f_t(\mathbf x_{t+1}). Intuitively a solution to our problem with FTL is then to make sure the strategy \mathbf x_t does not change too much — in other words: we want to regularize / stabilize!

3 Follow The Regularized Leader (FTRL)

We modify our selection

 \mathbf x_{t}=\operatorname{argmin}_{\mathbf x} \sum_{t=1}^{T} f_{t}(\mathbf x)+\frac{1}{\eta} \psi(\mathbf x)

where \eta > 0 is the learning rate and \psi: \mathcal X \rightarrow \mathbb R is a regularizer.

The regularizer is there to ensure that our choice does not change too much between iterations. To capture this we will rely on \psi being strongly convex. That is, \psi(\mathbf x)-\psi(\mathbf y) \leqslant \nabla \psi(\mathbf x)^{T}(\mathbf x-\mathbf y)-\frac{1}{2}\|\mathbf x-\mathbf y\|^{2} \quad \forall \mathbf y, \mathbf x \in \mathcal X

for some norm ||\cdot||.

3.1 Stability

This will allow us to prove the following notion of stability.

Stability says that

f_{t}\left(\mathbf x_{t}\right)-f_{t}\left(\mathbf x_{t+1}\right) \leqslant \eta\left\|\nabla f_{t}\left(\mathbf x_{t}\right)\right\|_{*}^{2}

with \|\cdot\|_* is the dual norm of \|\cdot\| defined as \|\mathbf x\|_{*}=\sup _{\|\mathbf y\| \leqslant 1} \mathbf x^{\top} \mathbf y.

Let’s define the function we minimizes in the decision selection

 F_{f}(\mathbf x)=\sum_{t7tau=1}^{t} f_{r}(\mathbf x)+\frac{1}{\eta} \psi(\mathbf x)

By strong convexity of \psi and convexity of f_\tau we have that This will ultimately let us bound f_t(\textbf x_t) - f_t(\textbf x_{t+1}).

F_{t-1}\left(\mathbf x_{t}\right)-F_{t-1}\left(\mathbf x_{t+1}\right) \leqslant\left\langle\nabla F_{t-1}\left(\mathbf x_{t}\right), \mathbf x_{t}-\mathbf x_{t+1}\right\rangle-\frac{1}{2 \eta}\left\|\mathbf x_{t}-\mathbf x_{t+1}\right\|^{2} \quad \text{for any } \mathbf{x}_t, \mathbf{x}_{t+1} \in \mathcal X.\qquad(1)

RFTL defines \mathbf x_{t}=\operatorname*{arg\,min}_{\mathbf{x}} F_{t-1}(\mathbf x) so F_{t-1}(\mathbf x_t) is the optimum and first order characterizationFirst order characterization of convexity says f(\mathbf y) - f(\mathbf x) \leq \nabla f(\mathbf y)^\top (\mathbf y - \mathbf x). So when f(\mathbf y) is minimum we have \nabla f(\mathbf y)^\top (\mathbf y - \mathbf x) \geq 0.

becomes

 \left\langle\nabla F_{t-1}\left(\mathbf x_{t}\right), \mathbf x_{t}-\mathbf x_{t+1}\right\rangle \leqslant 0.

So a weaker bound of eq. 1 is

\begin{aligned} F_{t-1}(\mathbf x_{t})-F_{t-1}(\mathbf x_{t+1}) & \leqslant-\frac{1}{2 \eta}\|\mathbf x_{t}-\mathbf x_{t+1}\|^{2} \quad \text { and } \\  F_{t}(\mathbf x_{t+1})-F_{t}(\mathbf x_{t}) & \leqslant-\frac{1}{2 \eta}\|\mathbf x_{t}-\mathbf x_{t+1}\|^{2}.\end{aligned}

The second line simply applies the same reasoning. We can now sum the two lines and expand the definition of F_i. All terms will cancel out except the last term in F_t(\mathbf x_{t+1}) and F_t(\mathbf x_{t}) so we get

\begin{aligned} f_{t}(\mathbf x_{t+1})-f_{t}(\mathbf x_{t}) \leqslant-\frac{1}{y}\|\mathbf x_{t}-\mathbf x_{t+1}\|^{2} &\Leftrightarrow \\ \|\mathbf x_{t}-\mathbf x_{t+1}\|^{2} \leqslant \eta(f_{t}(\mathbf x_{t})-f_{t}(\mathbf x_{t+1}))&.\end{aligned} \qquad(2)

Now we have the tools to bound the change Hölder’s inequality:
\langle x, y\rangle \leqslant\|x\|\|y\|_{*}.

.

\begin{aligned} f_{t}\left(\mathbf x_{t}\right)-f_{t}\left(\mathbf x_{t+1}\right) & \leqslant\left\langle\nabla f_{t}\left(\mathbf x_{t}\right), \mathbf x_{t}-\mathbf x_{t+1}\right\rangle && \text {(Convexity)} \\ & \leqslant\left\|\nabla f_{t}\left(\mathbf x_{t}\right)\right\|\left\|\mathbf x_{t}-\mathbf x_{t+1}\right\| && \text {(Hölder&39;s ineq.)} \\ & \leqslant\left\|\nabla f_{t}\left(\mathbf x_{t}\right)\right\|_{*} \sqrt{\eta} \sqrt{F_{t}\left(\mathbf x_{t}\right)-f_{t}\left(\mathbf x_{t+1}\right)} && \text{(using prev. eq.)} \end{aligned}

Solving for f_t(\mathbf x_t) -f_t(\mathbf x_{t+1}) we get

f\left(\mathbf x_{t}\right)-f_{t}\left(\mathbf x_{t+1}\right) \leqslant \eta\left\|\nabla_{t} f\left(\mathbf x_{t}\right)\right\|_{*}^{2}

Equipped with stability and the BTL lemma we almost directly obtain the regret bound for FTRL.

3.2 Regret bound for FTRL

With learning rate \eta we have

\mathcal R_{T}^{\mathrm{FTRL}} \leqslant \frac{D}{\eta}+\eta \sum_{t=1}^{T}\left\|\nabla f_{t}\left(\mathbf x_{t}\right)\right\|_{*}^{2}

where D=\max_{\mathbf x} \psi(\mathbf x)-m_{\mathbf x} n \varphi(\mathbf x).

By defining f_0(\mathbf x) = \frac 1\eta \psi(\mathbf x) we can write the regularized selection as

\mathbf x_{t}=\operatorname*{arg\,min}_{\mathbf x} \sum_{r=1}^{t-1} f_{\tau}(\mathbf x)+\frac{1}{\eta} \psi(\mathbf x)=\operatorname*{arg\,min}_{\mathbf x} \sum_{\tau=0}^{t-1} f_{\tau}(\mathbf x)

First note that by the BTL lemma we haveNote that we additionally use f_0(\mathbf x^*) \geq f_0(\mathbf x_1) since BTL only applies to t=1,...,T.

\sum_{t=0}^{T} f_{t}\left(\mathbf x^{*}\right) \geqslant \sum_{t=0}^{T} f_{t}\left(\mathbf x_{t+1}\right)

with

 w^{*}=\operatorname*{arg\,min}_{\mathbf x} \sum_{t=1}^{T} f_{t}(\mathbf x) \quad \text{(best fixed expert)}

Now we can bound the FTRL regret as followsWe want to use stability so we need to move from \mathbf x^* to \mathbf x_{t+1}. We do this by getting it on a form on which we can apply the BTL lemma.

 \begin{aligned} \mathcal R_T^{\mathrm{FTRL}} &=\sum_{t=1}^{T} f_{t}\left(\mathbf x_{t}\right)-\sum_{t=1}^{T} f_{t}\left(\mathbf x^{*}\right) \\ &=\sum_{t=1}^{T} f_{t}\left(\mathbf x_{t}\right)-\sum_{t=1}^{t} f_{t}\left(\mathbf x^{*}\right)+f_{0}\left(\mathbf x^{*}\right) \\ & \leqslant \sum_{t=1}^{T} f_{t}\left(\mathbf x_{t}\right)-\sum_{t=0}^{T} f_{t}\left(\mathbf x_{t+1})+f_{0}\left(\mathbf x^{*}\right) \right. && \text{(using BTL lemma)}\\ & \leqslant \eta \sum_{t=1}^{T} \| \nabla_{t} f\left(\mathbf x_{t}\right) \|_{*}^{2}+f_{0}\left(\mathbf x^{*}\right)-f_{0}\left(\mathbf x_{1}\right) && (\text {using stability}) \\ & \leqslant \eta \sum_{t=1}^{T}\left\|\nabla_{t} f\left(\mathbf x_{t}\right)\right\|_{*}^{2}+\frac{1}{\eta}(\underbrace{\max _{\mathbf x} \psi(\mathbf x) - \min_{\mathbf x} \psi(\mathbf x)}_{D})\end{aligned}

Further, if f_t(\mathbf x) is G-Lipschitz constraint (so that the gradient is bounded \| \nabla_t f_t(\mathbf x_t)\|_* \leq G) and the optimal learning rateThe optimal \eta is found by simply optimizing the above regret bound which is of the form: \operatorname*{arg\,max}_{n} a x+b x^{-1}=\frac{\sqrt{b}}{\sqrt{a}}.

\eta = \sqrt{\frac {D}{TG^2}} is chosen, then we have

\mathcal R_T^{\mathrm{FTRL}} = \mathcal O (G\sqrt{TD}).

4 FTRL with linear losses

At this point we have good regret bounds but at every iteration t we have to minimize a sum of t convex functions. This simplifies significantly if we assume that the loss is linear f_{t}(\mathbf x)=\left\langle \mathbf g_{t}, \mathbf x\right\rangle since we can push in the summation:

\begin{aligned} \mathbf x_{t} &=\operatorname*{arg\,min}_{\mathbf x} \sum_{\tau=1}^{t-1}\left\langle \mathbf g_{\tau}, \mathbf x\right\rangle+\frac{1}{\eta} \psi(\mathbf x) \\ &=\operatorname*{arg\,min}_{\mathbf x} \left\langle\sum_{\tau=1}^{t-1} \mathbf g_\tau, \mathbf x\right\rangle+\frac{1}{\eta} \psi(\mathbf x). \end{aligned}

Now we only have to keep track of a sum of gradients \mathbf u_t = \sum_{\tau=1}^T \mathbf g_\tau and the minimization problem reduces to a sum of a linear function and the strongly convex regularizer. So we can write the program as:

FTRL with linear loss

  1. \mathbf x_{t+1}=\operatorname*{arg\,min}_{\mathbf x}\left\langle \mathbf u_{t}, \mathbf x\right\rangle+\frac{1}{\eta} \psi(\mathbf x)
  2. \mathbf u_{t+1}=\mathbf u_{t}+\mathbf g_{t}

4.1 Linear Loss is Sufficient

It turns out that we can assume linear loss without loss of generality! That is, we can get the same regret bound for convex loss functions even if we run the above algorithm only using the gradient \mathbf g_t = \nabla f_t(\mathbf x_t) at each \mathbf x_t. To see this, consider

\begin{aligned} \mathcal R_{T}^{\mathrm{FTRL}} & \leqslant \max _{\mathbf x} \sum_{t=1}^{T} f_{t}\left(u_{t}\right)-f_{t}(\mathbf x) \\ & \leqslant \max _{\mathbf x} \sum_{t=1}^{T}\left\langle\nabla f_{t}\left(\mathbf x_{t}\right), \mathbf x_{t}-\mathbf x\right\rangle. && \text{(Convexity applied to each term)}\end{aligned}

So we can bound the regret for f_t(\mathbf x) by bounding the regret for the new problem which only considers its gradient f&39;_t(\mathbf x)=\braket{\nabla f_t(\mathbf x_t), \mathbf x}. The bound stays as tight since we already used convexity in our argument for stability.

These notes are primarily based on Prof. Luo’s lectures notes [1] with minor modifications borrowed from Hazan’s book [2] and the lecture notes of Prof. Orabona [3].

[1] “CSCI 699: Introduction to Online Learning.” [Online]. Available: https://haipeng-luo.net/courses/CSCI699/. [Accessed: 02-Nov-2019]

[2] E. Hazan, “Introduction to Online Convex Optimization,” Sep. 2019 [Online]. Available: http://arxiv.org/abs/1909.05207. [Accessed: 02-Nov-2019]

[3] “Introduction to Online Learning,” Parameter-free Learning and Optimization Algorithms, 20-Sep-2019. [Online]. Available: https://parameterfree.com/lecture-notes-on-online-learning/. [Accessed: 02-Nov-2019]