Bayesian Logistic Regression

I recently had to run an experiment with Bayesian Logistic Regression but didn’t find a place that contained both a satisfying explanation for the motivation and derivation of the gradient.

We assume that the class label y_i\in{0,1} can be predicted from the input x_i and some shared latent variable \beta through the logistic function \sigma(\cdot),

 \begin{split} P(y=1|x,\beta) &= \sigma(\beta^\top x) \quad \text{with} \quad \sigma(a)=\frac{\exp (a)}{1+\exp (a)}. \end{split}

This allows us to predict new points by marginalizing out the inferred distribution over \beta,

 p(y_* | x_*, Y,X) = \int p(y_* |x_*,\beta)p(\beta|X,Y)\ \mathrm{d}\beta.

We don’t know the posterior p(\beta|X,Y) though. How can we learn this distribution? By Bayes rule we obtain,

 p(\beta|Y,X) \propto p(Y|X,\beta)p(\beta).

MLE

A common way to estimate p(\beta|Y,X) is by a point estimate through maximum likelihood estimation (MLE). This is simply using the mode of the log-likelihood (effectively assuming uniform prior on p(\beta)),

 \hat{\beta} = \max_\beta p(Y|X,\beta).

For logistic regression there is no close form solution for the above but the negative log-likelihood is convex and positive definite so standard optimization techniques applies to find \hat{\beta}. But what if we wanted to get the distribution over \beta?

Approximation

P(\beta|Y,X) has no nice parametric form that we know. Usually two approaches are considered:

  • Decide on a deterministic method to obtain an approximation. If this lives within some chosen parametric family it is called variational inference. the Laplace approximation which construct a Gaussian around the mode of P(\beta|Y,X) can also be considered such a method.
  • Use MCMC to draw samples from P(\beta|Y,X) to construct a linear combination of point estimates as the approximation.

For either approaches the gradient is usually required.

2 Derivatives of log-likelihood

In all the approaches we needed to work with the log-likelihood. Specifically, we require the gradient. Assuming i.i.d. samples we can write the likelihood as,

 L(\beta ; Y, X)=\prod_{i=1}^{N}\left[\sigma\left(x_{i} \beta\right)\right]^{y_i}\left[1-\sigma\left(x_{i} \beta\right)\right]^{1-y_{i}}.

This correctly “activates” either of P(Y_i=1|X_i,\beta) and P(Y_i=0|X_i,\beta)=1-P(Y_i=1|X_i,\beta) depending on the value of y_i.

The log-likelihood is trivially,

 l(\beta ; Y, X)=\sum_{i=1}^{N}\left[-\ln \left(1+\exp \left(x_{i} \beta\right)\right)+y_{i} x_{i} \beta\right].

The derivative of the log-likelihood (sometimes called the score) can then be shown to be,

 \nabla_{\beta} l(\beta ; Y, X)=\sum_{i=1}^{N}\left[y_{i}-\sigma\left(x_{i} \beta\right)\right] x_{i}. \qquad(1)

3 In practice

The way we have defined the log-likelihood we need y_i\in \{0,1\} for things to work out nicely.

4 Connection with optimization

There is a nice connection with regularization if we instead of MLE did maximum a posteriori (MAP) which simply maximizes,

 p(\beta|X,Y) \propto p(\beta|X,Y)p(\beta).

Working with the potential functions instead (i.e. the log probability),

 \log p(\beta|X,Y) \propto \nabla_{\beta} l(\beta ; y, X) + \log p(\beta).

If we choose a Gaussian prior p(\beta) = \mathcal N(0, \lambda I) this is,

 \log p(\beta|X,Y) \propto \nabla_{\beta} l(\beta ; y, X) + \lambda \|\beta\|_2, \qquad(2)

Which is simply the negative log-likelihood with L_2-regularization!

5 Alternatives

The two assumptions we made in the beginning can be altered:

  • The logistic function can we exchange with another sigmoid function, e.g. the standard normal cumulative distribution.
  • We could choose a different prior than Gaussian. However, it is particularly nice computationally since it makes the optimization objective strongly convex. Even if we try to sample the posterior p(\beta|XY) instead of optimizing with MAP this is desirable. This is because gradient based sampling techniques such as Unadjusted Langevin Dynamics (ULA) similarly benefits from the strong convexity (see e.g. [1]).

6 Resources

A resource I found useful was Roman Garnett lecture notes.

7 Exercises

  1. Derive the gradient of the log-likelihood in eq. 1.
  2. Verify that Gaussian is equivalent to L_2-regularization in eq. 2.

[1] A. Durmus and S. Majewski, “Analysis of langevin monte carlo via convex optimization.” Journal of Machine Learning Research, vol. 20, no. 73, pp. 1–46, 2019.