\[\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min}\]

Gaussian processes and Hedge for infinite armed bandits#

Posted on 2020-01-07

Warning

This post is under construction.

  • Today we will cover a “non-convex” setting.

  • The setting is that of adversarial BO.

  • We will get results for infinite arm Bandits. Crucially we will introduce known adversarial actions.

Bayesian optimization#

Bayesian optimization (BO) is a variant of the Multi-Armed Bandit setting from the previous post. In particular, it considers:

  • Infinite arms: That is, the decision set is now some subset of the euclidean space \(\mathcal K \subset \mathbb R^N\).

  • Stochastic arms: The function is not changing between iterations but instead observed through some noisy observation, \(f_t(a) = f(a) + \epsilon(a) \ \forall t\). The noise \(\epsilon(\cdot)\) is usually assumed to be homogeneously Gaussian, i.e. \(\epsilon(\cdot) \sim \mathcal N(0, \sigma)\).

It tightens the assumption in MAB slightly by also assuming that the function is expensive to evaluate. This motivates use of methods for the point selection which has high computational budget such as Gaussian processes.

A popular example of BO is hyperparameter optimization, e.g.:

  • \(\mathcal K\) is the hyperparameters of a neural network such as depth, width, activation function etc.

  • \(f(a) + \epsilon(a)\) is the measured test accuracy of the neural network after training with the added noisy due to the randomness of the initialization. Notice that the evaluation is indeed very expensive.

Regret A small complication notation-wise is that BO usually considers maximization of a function instead of minimization. We are thus usually interested in the cumulative regret defined as

\[\mathcal R_T = \max_a \sum_{t=1}^T f(a) - \sum_{t=1}^T f(a_t)\]

where our algorithm selects \(a_t\).

This setting is well-studied but we will now complicate the setup by considering an adversarial setting instead of stochastic arms.

GP-UCB#

Algorithm 6 (GP-UCB)

Select the next point as

\[a_t = \argmax_a \operatorname{UCB}_t(a)\]

where \(\operatorname{UCB}_t(a) = \mu_t(a) + \beta_t\sigma_t(a).\)

GP-MW#

Setting In the context of OCO we are thus considering an infinite armed limited information setting with adversarially picked functions. Note that we have mentioned no assumption of convexity of the functions.

\[\mathcal R^{i}_T=\max _{a \in \mathcal{A}^{i}} \sum_{t=1}^{T} r^{i}\left(a, a_{t}^{-i}\right)-\sum_{t=1}^{T} r^{i}\left(a_{t}^{i}, a_{t}^{-i}\right) \]

In the work of Sessa et al. [2019] they crucially assume known adversarial actions. This is intuitively reasonable, since we need to exploit correlation between our actions in the infinite arm setting. Without known adversarial actions this correlation would be unknown.

This assumption allows us to formulate a rather simple algorithm:

  1. At each time \(t\) construct the acquisition function \(\operatorname{UCB}_t(a, a^{-i}_t)\) using all previous observations.

  2. Notice that \(\set{\operatorname{UCB}_t(a, a^{-i}_t)}_t\) is known on the full space in contrast to the loss \(\ell(a, a^{-i}_t)\). Thus the Hedge algorithm from the previous post can be applied directly. This is made precise below.

Algorithm 7 (GP-MW)

\[\begin{split}\begin{aligned} & [\hat \ell_t]_a = \min\set{1, \operatorname{UCB}_t(a, a_t^{-i})} && \text{(Compute optimistic estimate)} \\ & [w_{t+1}]_a = [w_{t}]_a \exp{(- \eta [\hat \ell_t]_a)} / Z. && \text{(Update mixed strategy)} \end{aligned}\end{split}\]

To understand exactly why this works, let us dig into the analysis.

Regret analysis#

The main ingredient will be characterizing the function by the posterior GP.

Lemma 4 (Confidence lemma)

Let

  • unknown \(f\) have bounded RKHS norm \(||f||_k \leq B\).

  • the noise \(\epsilon_t\) be \(\sigma\)-sub-Gaussian in our observation \(y_t=f_t(x_t) + \epsilon_t\).

  • \(\beta_{t}=B+\sqrt{2\left(\gamma_{t-1}+\log (1 / \delta)\right)}\).

Then with at least \(1-\delta\) the following holds

\[ \left|\mu_{t-1}(\mathbf{a})-f(\mathbf{a})\right| \leq \beta_{t} \sigma_{t-1}(\mathbf{a}), \quad \forall \mathbf{a} \in \mathcal{K}, \quad \forall t \geq 1. \]

where \(\mu_t\) and \(\sigma_t\) is the posterior mean and variance of our GP model conditioned on all observations up till time \(t\).

In other words we can say that the true function lies within some confidence bound computed by our Gaussian Process model with high probability.

Split into two terms#

Our main problem in the analysis is treating the limited information setting. Using the Confidence Lemma we will see how we can instead treat a full information setting.

Using the confidence lemma With this we can upper and lower bound the loss with high probability1This section is almost entirely lifted from the original paper.

\[U C B_{t}(\mathbf{a})-2 \beta_{t} \sigma_{t-1}(\mathbf{a}) \leq \ell^{i}(\mathbf{a}) \leq \min \left\{1, U C B_{t}(\mathbf{a})\right\}.\]

This particular way of writing the bounds allows us to split the regret into two parts,

\[\begin{split}\begin{aligned} \mathcal R^{i}_T &=\sum_{t=1}^{T} \ell^{i}\left(\bar{a}, a_{t}^{-i}\right)-\sum_{t=1}^{T} \ell^{i}\left(a_{t}^{i}, a_{t}^{-i}\right)\\ &\leq \sum_{t=1}^{T} \min \left\{1, U C B_{t}\left(\bar{a}, a_{t}^{-i}\right)\right\}-\sum_{t=1}^{T}\left[U C B_{t}\left(a_{t}^{i}, a_{t}^{-i}\right)-2 \beta_{t} \sigma_{t-1}\left(a_{t}^{i}, a_{t}^{-i}\right)\right]\\ &\leq \underbrace{\sum_{t=1}^{T} \min \left\{1, U C B_{t}\left(\bar{a}, a_{t}^{-i}\right)\right\}-\sum_{t=1}^{T} \min \left\{1, U C B_{t}\left(a_{t}^{i}, a_{t}^{-i}\right)\right\}}_{{\text{($*$)}}} + \underbrace{2 \beta_{T} \sum_{t=1}^{T} \sigma_{t-1}\left(a_{t}^{i}, a_{t}^{-i}\right)}_{{\text{($**$)}}}. \end{aligned}\end{split}\]

The first term (\(*\)) can now be treated as a full information setting where the function is \(\hat \ell_t(a) = \min \set{1, \operatorname{UCB}_t(a, a^{-i}_t)}\) while the second part (\(**\)) can be analysed independent of the algorithm2This trick is in fact more general and has also been applied to reinforcement learning [Cheng et al., 2019]..

Bounding with Hedge#

We cannot apply our tools from OCO directly to \(\hat \ell_t(a) = \min \set{1, \operatorname{UCB}_t(a, a^{-i}_t)}\) since it is non-convex in general. However, we can linearize the problem by optimizing a probability distribution instead.

Lemma 5

Given an Online Optimization problem with decisions \(a_t \in \mathcal K\) and \(\ell_t(\cdot)\) loss we can lift it into a linear problem in \(p_t \in \delta(\mathcal K)\),

\[\begin{split}\begin{aligned} \mathbb E[\mathcal R_T] & = \sum_{t=1}^T \mathbb E[\ell_t(a_t)] - \min_a \sum_{t=1}^T \ell_t(a) \\ & = \sum_{t=1}^T \braket{\ell_t, p_t} - \min_p \sum_{t=1}^T \braket{\ell_t, p}. \end{aligned}\end{split}\]

where expectation is taken w.r.t. the randomness of the algorithm.

Applying this to \(\hat \ell_t(a)\) allows us to bound (\(*\)) in the regret using Hedge from the previous post 3Notice that even though we recast it in terms of a mixed strategy \(p_t\) the loss we incur is still w.r.t. to a single arm selection \(a_t\). This can seem slightly confusing. However, this slight modification of Hedge has no effect on the regret, since \(p_t\) is still constructed based on full information. Thus, the expectation of the loss over the randomness of the algorithm is equivalent to the linear loss on the mixed strategy, \(E_{a_t \sim p_t}[\ell(a_t)]=\braket{\ell_t, p_t}\).even though \(\ell_t(\cdot)\) is non-convex3Notice that even though we recast it in terms of a mixed strategy \(p_t\) the loss we incur is still w.r.t. to a single arm selection \(a_t\). This can seem slightly confusing. However, this slight modification of Hedge has no effect on the regret, since \(p_t\) is still constructed based on full information. Thus, the expectation of the loss over the randomness of the algorithm is equivalent to the linear loss on the mixed strategy, \(E_{a_t \sim p_t}[\ell(a_t)]=\braket{\ell_t, p_t}\)..

Using the high probability bound relying on Azume-Hoeffding inequility from the previous post we get with probability \(1-\delta\) that

\[(*) = \mathcal{O}(\sqrt{T \log K_{i}}+\sqrt{T \log (2 / \delta)}). \]

Bounding the variance term#

What remains is to bound (\(**\)). This part is what requires us to have known adversarial actions. Otherwise we can intuitively not ensure that the variance will decrease.

Final regret bound#

Theorem 3 (GP-MW finite arms regret bounds)

Let

  • \(\beta_{t}=B+\sqrt{2\left(\gamma_{t-1}+\log (2 / \delta)\right)}\)

  • \(\eta=\sqrt{\left(8 \log K_{i}\right) / T}\)

then

\[ R^{i}(T)=\mathcal{O}(\sqrt{T \log K_{i}}+\sqrt{T \log (2 / \delta)}+B \sqrt{T \gamma_{T}}+\sqrt{T \gamma_{T}\left(\gamma_{T}+\log (2 / \delta)\right)})\]

Discretization#

So far the algorithm has been defined in terms of a finite number of arms (even if the \(\operatorname{UCB}_t(\cdot)\) on which the mixed strategy is computed is in fact continuous). This does not help solve the non-convex infinite arm setting which we set out for originally. One trivial solution to this taken in Sessa et al. [2019] is to discretize the space.

For this we just have to be careful to cover our space densely enough so that the regret will not suffer. The Lipschitz assumption on \(\ell_t(\cdot)\) will ensure this, i.e. if points are close to the discretization their loss will also be close.

To make this precise:

Corollary 3 (GP-MW infinite arms regret bounds)

Discretize \(\mathcal K_i \subset \mathbb R^d_i\) with \([\mathcal K_i]_T\) where \(\left|\left[\mathcal{A}^{i}\right]_{T}\right|=(L b \sqrt{d_{i} T})^{d}\) so that \(\left\|a-[a]_{T}\right\|_{1} \leq \sqrt{d_{i} / T} / L\).

Final remarks#

In the OCO framing we have thus treated non-convex function in the bandit setting when the decision set is a simplex (finite or infinite).

For more general non-convex problems for online optimization we face slightly different problems. First, how do we even define regret? After all we can do arbitrarily bad since in the non-convex setting we are only promised to converge to a local minimum. Two recent papers deals with these questions .

I have intentionally left out some details of the canonical BO/GP part to keep distractions to a minimum – but with the obvious danger of being imprecise. For precisely specified assumptions of each lemma and statement please see Sessa et al. [2019].


CdCBG19

Ching-An Cheng, Remi Tachet des Combes, Byron Boots, and Geoff Gordon. A reduction from reinforcement learning to no-regret online learning. 2019. arXiv:1911.05873.

SBKK19(1,2,3)

Pier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour, and Andreas Krause. No-regret learning in unknown games with correlated payoffs. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingle Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 13602–13611. Curran Associates, Inc., 2019. URL: http://papers.nips.cc/paper/9514-no-regret-learning-in-unknown-games-with-correlated-payoffs.pdf.