Gaussian Processes and Hedge for infinite armed bandits

Warning: this post is under construction.

1 Introduction

  • Today we will cover a “non-convex” setting.
  • The setting is that of adversarial BO.
  • We will get results for infinite arm Bandits. Crucially we will introduce known adversarial actions.

2 Bayesian Optimization

Bayesian Optimization (BO) is a variant of the Multi-Armed Bandit setting from the previous post. In particular, it considers:

  • Infinite arms: That is, the decision set is now some subset of the euclidean space \mathcal K \subset \mathbb R^N.
  • Stochastic arms: The function is not changing between iterations but instead observed through some noisy observation, f_t(a) = f(a) + \epsilon(a) \ \forall t. The noise \epsilon(\cdot) is usually assumed to be homogenously Gaussian, i.e. \epsilon(\cdot) \sim \mathcal N(0, \sigma).

It tightens the assumption in MAB slightly by also assuming that the function is expensive to evaluate. This motivates use of methods for the point selection which has high computational budget such as Gaussian Processes.

A popular example of BO is hyperparameter optimization, e.g.:

  • \mathcal K is the hyperparameters of a neural network such as depth, width, activation function etc.
  • f(a) + \epsilon(a) is the measured test accuracy of the neural network after training with the added noisy due to the randomness of the initialization. Notice that the evaluation is indeed very expensive.

A small complication notation-wise is that BO usually considers maximization of a function instead of minimization. We are thus usually interested in the cumulative regret defined as

\mathcal R_T = \max_a \sum_{t=1}^T f(a) - \sum_{t=1}^T f(a_t)

where our algorithm selects a_t.

This setting is well-studied but we will now complicate the setup by considering an adverserial setting instead of stochastic arms.


(GP-UCB) Select the next point as

a_t = \operatorname*{arg\,max}_a \operatorname{UCB}_t(a)

where \operatorname{UCB}_t(a) = \mu_t(a) + \beta_t\sigma_t(a).



In the context of OCO we are thus considering an infinite armed limited information setting with adversarially picked functions. Note that we have mentioned no assumption of convexity of the functions.

\mathcal R^{i}_T=\max _{a \in \mathcal{A}^{i}} \sum_{t=1}^{T} r^{i}\left(a, a_{t}^{-i}\right)-\sum_{t=1}^{T} r^{i}\left(a_{t}^{i}, a_{t}^{-i}\right)

In the work of [1] they crucially assume known adversarial actions. This is intuitively reasonable, since we need to exploit correlation between our actions in the infinite arm setting. Without known adversarial actions this correlation would be unknown.

This assumption allows us to formulate a rather simple algorithm: 1) At each time t construct the acquisition function \operatorname{UCB}_t(a, a^{-i}_t) using all previous observations. 2) Notice that \set{\operatorname{UCB}_t(a, a^{-i}_t)}_t is known on the full space in contrast to the loss \ell(a, a^{-i}_t). Thus the Hedge algorithm from the previous post can be applied directly. This is made precise below.

(GP-MW) \begin{aligned} & [\hat \ell_t]_a = \min\set{1, \operatorname{UCB}_t(a, a_t^{-i})} && \text{(Compute optimistic estimate)} \\ & [w_{t+1}]_a = [w_{t}]_a \exp{(- \eta [\hat \ell_t]_a)} / Z. && \text{(Update mixed strategy)} \end{aligned}

To understand exactly why this works, let us dig into the analysis.

5 Regret Analysis

The main ingredient will be characterizing the function by the posterior GP.

(Confidence lemma) Let

  • unknown f have bounded RKHS norm ||f||_k \leq B.
  • the noise \epsilon_t be \sigma-sub-Gaussian in our observation y_t=f_t(x_t) + \epsilon_t.
  • \beta_{t}=B+\sqrt{2\left(\gamma_{t-1}+\log (1 / \delta)\right)}.

Then with at least 1-\delta the following holds

 \left|\mu_{t-1}(\mathbf{a})-f(\mathbf{a})\right| \leq \beta_{t} \sigma_{t-1}(\mathbf{a}), \quad \forall \mathbf{a} \in \mathcal{K}, \quad \forall t \geq 1.

where \mu_t and \sigma_t is the posterior mean and variance of our GP model conditioned on all observations up till time t.

In other words we can say that the true function lies within some confidence bound computed by our Gaussian Process model with high probability.

5.1 Split into two terms

Our main problem in the analysis is treating the limited information setting. Using the Confidence Lemma we will see how we can instead treat a full information setting.

Using The Confidence Lemma

With this we can upper and lower bound the loss with high probabilityThis section is almost entirely lifted from the original paper.

U C B_{t}(\mathbf{a})-2 \beta_{t} \sigma_{t-1}(\mathbf{a}) \leq \ell^{i}(\mathbf{a}) \leq \min \left\{1, U C B_{t}(\mathbf{a})\right\}.

This particular way of writing the bounds allows us to split the regret into two parts,

\begin{aligned} \mathcal R^{i}_T &=\sum_{t=1}^{T} \ell^{i}\left(\bar{a}, a_{t}^{-i}\right)-\sum_{t=1}^{T} \ell^{i}\left(a_{t}^{i}, a_{t}^{-i}\right)\\ &\leq \sum_{t=1}^{T} \min \left\{1, U C B_{t}\left(\bar{a}, a_{t}^{-i}\right)\right\}-\sum_{t=1}^{T}\left[U C B_{t}\left(a_{t}^{i}, a_{t}^{-i}\right)-2 \beta_{t} \sigma_{t-1}\left(a_{t}^{i}, a_{t}^{-i}\right)\right]\\ &\leq    \underbrace{\sum_{t=1}^{T} \min \left\{1, U C B_{t}\left(\bar{a}, a_{t}^{-i}\right)\right\}-\sum_{t=1}^{T} \min \left\{1, U C B_{t}\left(a_{t}^{i}, a_{t}^{-i}\right)\right\}}_{{\text{($*$)}}}   + \underbrace{2 \beta_{T} \sum_{t=1}^{T} \sigma_{t-1}\left(a_{t}^{i}, a_{t}^{-i}\right)}_{{\text{($**$)}}}. \end{aligned}

The first term (*) can now be treated as a full information setting where the function is \hat \ell_t(a) = \min \set{1, \operatorname{UCB}_t(a, a^{-i}_t)} while the second part (**) can be analysed independent of the algorithmThis trick is in fact more general and has also been applied to Reinforcement Learning [2].


5.1.2 Bounding with Hedge

We cannot apply our tools from OCO directly to \hat \ell_t(a) = \min \set{1, \operatorname{UCB}_t(a, a^{-i}_t)} since it is non-convex in general. However, we can linearize the problem by optimizing a probability distribution instead.

Given an Online Optimization problem with decisions a_t \in \mathcal K and \ell_t(\cdot) loss we can lift it into a linear problem in p_t \in \delta(\mathcal K),

\begin{aligned} \mathbb E[\mathcal R_T]  & = \sum_{t=1}^T \mathbb E[\ell_t(a_t)] - \min_a \sum_{t=1}^T \ell_t(a) \\ & = \sum_{t=1}^T \braket{\ell_t, p_t} - \min_p \sum_{t=1}^T \braket{\ell_t, p}. \end{aligned}

where expectation is taken w.r.t. the randomness of the algorithm.

Applying this to \hat \ell_t(a) allows us to bound (*) in the regret using Hedge from the previous post even though \ell_t(\cdot) is non-convexNotice that even though we recast it in terms of a mixed strategy p_t the loss we incur is still w.r.t. to a single arm selection a_t. This can seem slightly confusing. However, this slight modification of Hedge has no effect on the regret, since p_t is still constructed based on full information. Thus, the expectation of the loss over the randomness of the algorithm is equivalent to the linear loss on the mixed strategy, E_{a_t \sim p_t}[\ell(a_t)]=\braket{\ell_t, p_t}.


Using the high probability bound relying on Azume-Hoeffding inequility from the previous post we get with probability 1-\delta that

(*) = \mathcal{O}(\sqrt{T \log K_{i}}+\sqrt{T \log (2 / \delta)}).

5.1.3 Bounding the Variance Term

What remains is to bound (**). This part is what requires us to have known adversarial actions. Otherwise we can intuitively not ensure that the variance will decrease.

5.1.4 Final Regret Bound

(GP-MW finite arms regret bounds) Let

  • \beta_{t}=B+\sqrt{2\left(\gamma_{t-1}+\log (2 / \delta)\right)}
  • \eta=\sqrt{\left(8 \log K_{i}\right) / T}


 R^{i}(T)=\mathcal{O}(\sqrt{T \log K_{i}}+\sqrt{T \log (2 / \delta)}+B \sqrt{T \gamma_{T}}+\sqrt{T \gamma_{T}\left(\gamma_{T}+\log (2 / \delta)\right)})

5.2 Discretization

So far the algorithm has been defined in terms of a finite number of arms (even if the \operatorname{UCB}_t(\cdot) on which the mixed strategy is computed is in fact continuous). This does not help solve the non-convex infinite arm setting which we set out for originally. One trivial solution to this taken in [1] is to discretize the space.

For this we just have to be careful to cover our space densely enough so that the regret will not suffer. The Lipschitz assumption on \ell_t(\cdot) will ensure this, e.i. if points are close to the discretization their loss will also be close.

To make this precise:

(GP-MW infinite arms regret bounds) Discretize \mathcal K_i \subset \mathbb R^d_i with [\mathcal K_i]_T where \left|\left[\mathcal{A}^{i}\right]_{T}\right|=(L b \sqrt{d_{i} T})^{d} so that \left\|a-[a]_{T}\right\|_{1} \leq \sqrt{d_{i} / T} / L.

5.3 Final Remarks

In the OCO framing we have thus treated non-convex function in the bandit setting when the decision set is a simplex (finite or infinite).

For more general non-convex problems for online optimization we face slightly different problems. First, how do we even define regret? After all we can do arbitrarily bad since in the non-convex setting we are only promised to converge to a local minimum. Two recent papers deals with these questions .

I have intentionally left out some details of the canonical BO/GP part to keep distractions to a minimum — but with the obvious danger of being imprecise. For precisely specified assumptions of each lemma and statement please see [1].

[1] P. G. Sessa, I. Bogunovic, M. Kamgarpour, and A. Krause, “No-regret learning in unknown games with correlated payoffs,” in Advances in neural information processing systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 13602–13611 [Online]. Available:

[2] C.-A. Cheng, R. T. des Combes, B. Boots, and G. Gordon, “A reduction from reinforcement learning to no-regret online learning.” 2019.