Hedge and bandits#

Posted on 2020-01-06

We will prepare ourselves for a non-convex setting that uses results for OCO.

  • First derive Hedge as a particular case of FTRL.

  • Introduce the Bandit setting for which Hedge can almost be immediately applied.

Expert problem and Hedge#

Problem formulation In the expert problem we are to select a weighing of experts which minimize a loss that is simply the sum of losses associated with each expert. An example of this setting could be portfolio selection where each stock is an expert and the loss is the negative return. Formalized in our OCO framework this corresponds to a decision space being the finite dimensional simplex \(\Omega = \Delta_N\) and the loss linear \(f_t(p) = \braket{p,\ell_t}\) for which we will further assume boundedness \(\ell_t \in [0,1]\).

  1. Player picks \(p_t \in \Delta_N\).

  2. Player observed \(\ell_t\) and suffers \(\braket{p_t, \ell_t}\).

We are thus interested in minimizes the following loss,

\[\mathcal{R}_{T}=\sum_{t=1}^{T}\left\langle p_{t}, \ell_{t}\right\rangle-\min _{p \in \Delta_N} \sum_{t=1}^{T}\left\langle p, \ell_{t}\right\rangle=\sum_{t=1}^{T}\left\langle p_{t}, \ell_{t}\right\rangle-\sum_{t=1}^{T} \ell_{t}\left(i^{\star}\right)\]

with \(i^{\star} \in \operatorname{argmin}_{i} \sum_{t=1}^{T} \ell_{t}(i)\).

FTRL#

If we want to apply FTRL on this we need to choose the regularizer. A natural choice when the decision set is the simplex is to choose the (negative) entropy

\[\psi(p) = \sum_{i=1}^{N} p(i) \ln p(i).\]

Our proposed solution is thus,

Algorithm 2 (FTRL with negative entropic regularization)

\[p_{t}=\underset{p \in \Delta(N)}{\operatorname{argmin}}\left\langle p, \sum_{\tau=1}^{t-1} \ell_{t}\right\rangle+\frac{1}{\eta} \sum_{i=1}^{N} p(i) \ln p(i).\]

Regret Bounds#

For the regret bound to go through we need to prove that the negative entropy function is strongly convex w.r.t. the \(\ell_1\)-norm,

\[\psi(p)-\psi(q) \leq\langle\nabla \psi(p), p-q\rangle-\frac{1}{2}\|p-q\|_{1}^{2}.\]

Under the \(\ell_1\)-norm this reduces to the well-known Pinsker’s inequality which relates the total variation distance (or half the \(\ell_1\)-norm) with the KL-divergence1We need the derivatives of negative entropy: \(\partial \psi(p)/\partial p(i) = \ln p(i) + 1.\),

\[\begin{split}\begin{aligned} \frac{1}{2}\|p-q\|_{1}^{2} & \geq \langle\nabla \psi(p), p-q\rangle - \psi(p) + \psi(q) && \text{(rearranging)} \\ & = \sum_i q(i) \ln \frac{q(i)}{p(i)} && \text{(by expanding and seeing terms cancel)} \\ & = \operatorname{KL}(p,q). \end{aligned}\end{split}\]

Knowing that our assumptions holds we can move on to simplifying the general regret bound obtained with FTRL for our particular case. Recall the bound,

\[\begin{aligned} \mathcal R_T^{\mathrm{FTRL}} \leqslant \eta \sum_{t=1}^{T}\left\|\nabla_{t} f\left(\mathbf x_{t}\right)\right\|_{*}^{2}+\frac{1}{\eta}(\max _{\mathbf x} \psi(\mathbf x) - \min_{\mathbf x} \psi(\mathbf x)). \end{aligned}\]

We need to bound the dual norm of gradient and the term depending on the negative entropy. For the former term we can bound it through the loss,

\[\begin{split}\begin{aligned} \left\|\nabla_{t} f\left(\mathbf x_{t}\right)\right\|_{*} & = \left\|\nabla_{t} f\left(\mathbf x_{t}\right)\right\|_{\infty} && \text{(dual for $\ell_1$)} \\ &= \left\|\ell_t \right\|_{\infty} && \text{(gradient of linear loss)} \\ &\leq 1 && \text{(assumed boundedness of loss)}\end{aligned}\end{split}\]

The sum of squares of these terms are thus trivially bounded by \(T\).

Concerning the negative entropy: the discrete distribution which maximizes the negative entropy is a distribution with point mass while the minimum is a uniform distribution. The negative entropy for these two distributions are \(\max_p \psi(p) = 0\) and \(\min_p \psi(p) = -\ln N\) respectively. Thus we obtain the following bound2Be aware that this can be tightened to \(\mathcal R_T^{\mathrm{FTRL}} \leq \sqrt{(T \ln N) / 2}\). See e.g. Prediction, Learning, and Games.,

\[\begin{split}\begin{aligned} \mathcal R_T^{\mathrm{FTRL}} &\leq T \eta + \frac{\ln N}{\eta}\\ &\leq 2 \sqrt{T \ln N}. && \text{(optimizing $\eta$)} \end{aligned} \end{split}\]

Regret bound for samples Say we are incurring loss \(\ell_t(i_t)\) where \(i_t \sim p_t\) instead of incurring the loss over the entire mixed strategy \(\braket{\ell_t, p_t}\). In expectation these are obviously the same, \(\mathbb E_{i_t \sim p_t}[\ell(i_t)]=\braket{\ell_t, p_t}\). So the following regrets

\[\begin{split}\begin{aligned} \tilde{\mathcal R}_T &= \sum_{t=1}^{T}\left\langle p_{t}, \ell_{t}\right\rangle - \sum_{t=1}^{T} \ell_{t}\left(i^{\star}\right) \\ \mathcal R_T &= \sum_{t=1}^T \ell_t(i_t) - \sum_{t=1}^{T} \ell_{t}\left(i^{\star}\right), \end{aligned}\end{split}\]

are equivalent in expectation,

\[\mathbb E[\mathcal R_T] = \tilde{\mathcal R}_T. \]

However, we might be interested in bounded the regret with high probability instead3There is a good discussion on high probability regret bounds mainly in the context of Bandits which we will cover promptly. The motivation is simple: even though we might be doing well in expectation the variance might be very high. So our algorithm is still very risky in the sense that it is likely that a concrete instantiation will have high regret.. This can naturally be reduced to a concentration argument where we make a statement about how far most of the random variables \(\ell_{t}\left(i_{t}\right)\) are from the expectation \(\left\langle p_{t}, \ell_{t}\right\rangle\).

Oblivious adversary Subtlety: the history, that the environment may depend on, is thus now a random variable. We call the adversary oblivious if it chooses the stream of losses prior to the game, i.e. the randomness of our algorithm does not effect what environment we are compared against. This is the setting we will be considering.

Even in this simplifying setting be can still ask to bound the regret with high probability instead of the expectation. We will use a general concentration inequality which relies on a martingale difference sequences.

Definition 2 (Martingale difference sequence)

The random variables \((Z_t)_{t\in \mathbb N}\) is a martingale sequence if

\[\mathbb E[Z_t | \mathcal H_{t-1}] = 0 \]

where \(\mathcal H_{i}\) is the filtration.

Lemma 2 (Azuma-Hoeffding inequality)

If \(Z_t\) is martingale difference sequence then

\[P\left[\sum_{t=1}^T Z_t \leq \sqrt{2 \ln (1/\delta) \sum_{t=1}^T \alpha_t^2}\right] \geq 1-\delta \]

with \(|Z_t| \leq \alpha_t\).

This readily lets us bound the regret. Pick \(Z_t = \ell_{t}\left(i_{t}\right) - \braket{p_{t}, \ell_{t}}\). We then know that

  • \(Z_t\) form a martingale difference sequence since \(\mathbb E[\ell_{t}\left(i_{t}\right) - \braket{p_{t}, \ell_{t}} | \mathcal H_{t-1}] = 0\).

  • \(|Z_t| \leq 2\) by boundedness assumption on the loss.

So we can apply the Azuma-Hoeffding inequality. With probability \(1-\delta\) we have

\[\begin{split}\begin{aligned} \sum_{t=1}^{T} Z_{t} &= \mathcal R_T - \tilde{\mathcal R}_T \leq \sqrt{8 \ln (1 / \delta) T} \Leftrightarrow\\ \mathcal R_T &\leq \tilde{\mathcal R}_T + \sqrt{8 \ln (1 / \delta) T}\\ &\leq 2 \sqrt{T \ln N} + \sqrt{8 \ln (1 / \delta) T} && \text{(Using Hedge bound)} \end{aligned}\end{split}\]

This gives us a high probability regret bound.

Corollary 2 (High probability regret bound)

With probability at least \(1-\delta\) and \(\eta = \mathcal O \left(\sqrt{\ln K / T}\right)\),

\[\mathcal R_T = \mathcal{O} \left(\sqrt{T \ln N}+\sqrt{T \ln \frac{1}{\delta}}\right). \]

Reducing the algorithm to softmax#

At each step we need to solve the constraint minimization problem,

\[p_{t}=\underset{p \in \Delta(N)}{\operatorname{argmin}}\left\langle p, \sum_{\tau=1}^{t-1} \ell_{t}\right\rangle+\frac{1}{\eta} \sum_{i=1}^{N} p(i) \ln p(i).\]

This turns out to have a closed form solution for our choice of regularization.

To obtain this, notice that the constraint is an equality constraint,

\[\begin{split}\begin{aligned} \min_p& \overbrace{\left\langle p, \sum_{\tau=1}^{t-1} \ell_{t}\right\rangle+\frac{1}{\eta} \sum_{i=1}^{N} p(i) \ln p(i)}^{=f(p)},\\ \text{s.t.}& \underbrace{\sum_{i=1}^N p_i - 1}_{=g(p)} = 0. \end{aligned} \end{split}\]

So we can obtain the exact solution by instead considering the Langrange function,

\[\begin{split}\begin{aligned} \mathcal L(p, \lambda) &= f(p) + \lambda g(p)\\ &= \left\langle p, \sum_{\tau=1}^{t-1} \ell_{t}\right\rangle+\frac{1}{\eta} \sum_{i=1}^{N} p(i) \ln p(i) + \lambda\left(\sum_i p_i - 1\right). \end{aligned} \end{split}\]

Setting the derivatives to zero

\[\begin{split}\begin{aligned} \frac{\partial \mathcal L(p, \lambda)}{\partial p(i)} &= \sum_{\tau=1}^{t-1} \ell_{t}(i) + \frac 1\eta (1 + \ln p(i)) - \lambda = 0\\ \Leftrightarrow\ p(i) &= \exp{\left(-\eta \sum_{\tau=1}^{t-1} \ell_{t}(i) - (1-\eta \lambda)\right)}\\ & = \exp{\left(-\eta \sum_{\tau=1}^{t-1} \ell_{t}(i)\right)} \Big/ \exp{\left(1-\eta \lambda\right)}. \end{aligned}\end{split}\]

To specify the remaining unknown, \(\lambda\), we use its associated first order condition and plug-in our derived expressions for the \(p(i)\)’s,

\[\begin{split}\begin{aligned} \frac{\partial \mathcal L(p, \lambda)}{\partial \lambda}=1-\sum_{i=1}^{N} p(i) = 0 \\ \Leftrightarrow 1 - \sum_{i=1}^N \exp{\left(-\eta \sum_{\tau=1}^{t-1} \ell_{t}(i)\right)} / \exp{\left(1-\eta \lambda)\right)} = 0 \\ \Leftrightarrow \frac{1}{\exp{\left(1-\eta \lambda)\right)}} = \sum_{i=1}^N \exp{\left(-\eta \sum_{\tau=1}^{t-1} \ell_{t}(i)\right)}. \end{aligned} \end{split}\]

This gives us an update rule as follows:

Algorithm 3 (Hedge)

It updates \(p\) element-wise as follows,

\[\begin{split}\begin{aligned} &w_t(i) = -\eta \sum_{\tau=1}^{t-1} \ell_{t}(i) && \text{(Compute weights)}\\ &p_t(i) = \exp{\left(w_t(i)\right)} / \sum_{j=1}^N \exp{\left(w_t(i)\right)}. && \text{(Turn into prob. dist.)} \end{aligned}\end{split}\]

Re-reformulation: multiplicative weight update Alternatively we can rewrite this to save recomputation by reusing the previously computed weights.

Algorithm 4 (Multiplicative Weight Update variant of Hedge)

It updates \(p\) element-wise as follows,

\[\begin{split}\begin{aligned} &w_t(i) =w_{t-1}(i) -\eta \ell_{t-1}(i) && \text{(Compute weights)}\\ &p_t(i) = \exp{\left(w_t(i)\right)} / \sum_{j=1}^N \exp{\left(w_t(i)\right)}. && \text{(Turn into prob. dist.)} \end{aligned}\end{split}\]

Final Remarks#

So to summarize: for the simplex introducing the entropy as regularizer simply reduces to replacing the max (if we were to use Follow The Leader) with a softmax.

This method goes by many other names in the literature apart from Hedge: Exponentiated Gradient Descent, Weighted Majority and Multiplicative Weight Update to mention the most common.

Exercises

  • What happens if we choose a different norm than \(\ell_1\)?

  • What about a different regularizer than negative entropy?

Multi-Armed Bandit#

In multi-armed bandit we consider a slight modification to the setting for which we applied Hedge. The difference lies in considering limited feedback. That is, we only observe the loss \(\ell_t(i_t)\) which we incur instead of the entire function \(\ell_t\).

Bandit setting The player is given a fixed decision set \(\mathcal K\). At each round \(t=1,...,T\):

  1. Player pick \(i_t \in \mathcal K\).

  2. The environment picks a convex loss vector \(\ell_t\).

  3. The player then observes and suffers loss \(\ell_t(i_t)\).

Exploration/Exploitation Central to this setting is the exploration-exploitation tradeoff: we want to learn the best arm to pull (explore) but also pull the right arm (exploit). It is a tradeoff since learning something about a particular arm now requires us to choose it and incur the associated loss.

Finite arms Much of the notation will allow a continuos \(\mathcal K\) (i.e. infinite arms) but we will only consider finite arm bandits in this post, i.e. when \(|\mathcal K| = K\).

Applying Hedge We cannot naively apply Hedge as it would require full information of the previous losses \(f_t\). What we can do is construct an unbiased estimator of the gradient on which we then apply Hedge.

Lemma 3 (Importance-Weighted Estimator)

Say you sample \(I_t \sim p_t\) and construct

\[\begin{split}\tilde{\ell}_{t}(i)=\begin{cases} \ell_{t}(i) / p_t(i) & i = I_t \\ 0 & \mathrm{otherwise.} \end{cases}\end{split}\]

then this estimator is unbiased

\[\mathbb E_{I_t \sim p_t}\left[\tilde{\ell}_{t}(i)\right]=\left(1-p_{t}(i)\right) \cdot 0+p_{t}(i) \cdot \frac{\ell_{t}(i)}{p_{t}(i)}=\ell_{t}(i).\]

Combining importance-weighting with Hedge leads to what has been called Exp3.

Algorithm 5 (Exp3)

Initialize uniformly: \(w_0(i) = 0\ \forall i\)

Then for \(t=1..T\)

\[\begin{split}\begin{aligned} &p_t(i) = \exp{\left(w_{t-1}(i)\right)} / \sum_{j=1}^N \exp{\left(w_{t-1}(j)\right)} && \text{(Turn into prob. dist.)}\\ &I_t \sim p_t && \text{(sample arm)}\\ &\tilde{\ell}_{t}(i)=\begin{cases} \ell_{t}(i) / p_t(i) & i = I_t \\ 0 & \mathrm{otherwise} \end{cases} && \text{(Construct estimator)}\\ &w_t(i) =w_{t-1}(i) -\eta \tilde{\ell}_{t-1}(i). && \text{(Compute weights)} \end{aligned}\end{split}\]

Regret analysis The standard FTRL analysis would fail. To see why, let’s look at the regret on our estimator

\[\tilde{\mathcal R}_T := \leq \frac{\ln K}{\eta}+\eta \sum_{t=1}^{T}\left\|\tilde{\ell}_{t}\right\|_{\infty}^{2}. \]

In order to say anything on the actual regret we need to take the expectation,

\[\begin{split}\begin{aligned} \mathbb E[\mathcal R_T] &= \mathbb E[\tilde{\mathcal R}_T] \\ &\leq \frac{\ln K}{\eta}+\eta \sum_{t=1}^{T}\mathbb E\left[\left\|\tilde{\ell}_{t}\right\|_{\infty}^{2}\right] \\ &\leq \frac{\ln K}{\eta}+\eta \sum_{t=1}^{T}\sum_{i\in \mathcal K}\frac{\ell_{t}(i)^2}{p_t(i)}. \end{aligned}\end{split}\]

Notice how it will blow up if \(p_t(i)\) is very small for some \(i\) 4The initial analysis bounded the regret by explicitly exploring uniformly based on a coin flip. This way \(p_t(i)\) was ensured to be sufficiently big for all \(i\)..

We can tighten the analysis, however, using some tricks almost exclusively reserved for the particular case of Exp3.

Theorem 2 (Exp3 Regret bound)

\[\mathcal R_T = \mathcal O (\sqrt{T K \log (K)).} \]

Proof can be found here.

Remarks:

    5See e.g. this proof.
  • The lower bound for adversarial regret is \(\mathcal{O}(\sqrt{TK})\)5See e.g. this proof.. So we achieve minimax regret up to a \(\sqrt{\ln K}\) factor.

  • Compared with the full information setting we are competitive up to a factor of \(\sqrt{K}\).

  • Francesco Orabona has a very different take on the proof coming from an Online Convex Optimization perspective that might be of interest.

Final Remarks#

This presentation primarily pulls from the following sources:

  • The Hedge part heavily relies on Haipeng’s lecture note 3 except for the (standard) derivation with Langrange multiplier which I weirdly enough didn’t come across anywhere.

  • The Bandit part is loosly based on Haipeng’s lecture note 12 and lecture note 8 but fundamentally altered to prepare us for Gaussian Processes with Multiplicative Weights which we will cover next.

I’ve skipped through some standard results in the Bandit literature: Explore-then-exploit and Upper Confidence Bound (UCB) to not get us too afar astray from the analysis we have seen so far in OCO. A good starting point for the curious reader is either Haipeng’s notes or one of the many books on Bandits (the literature is vast!).