Bayesian logistic regression#
Posted on 2020-05-15
I recently had to run an experiment with Bayesian logistic regression but didn’t find a place that contained both a satisfying explanation for the motivation and derivation of the gradient.
We assume that the class label
This allows us to predict new points by marginalizing out the inferred distribution over
We don’t know the posterior
MLE
A common way to estimate
For logistic regression there is no close form solution for the above but the negative log-likelihood is convex and positive definite so standard optimization techniques applies to find
Approximation
Decide on a deterministic method to obtain an approximation. If this lives within some chosen parametric family it is called variational inference. the Laplace approximation which construct a Gaussian around the mode of
can also be considered such a method.Use MCMC to draw samples from
to construct a linear combination of point estimates as the approximation.
For either approaches the gradient is usually required.
Derivatives of log-likelihood#
In all the approaches we needed to work with the log-likelihood. Specifically, we require the gradient. Assuming i.i.d. samples we can write the likelihood as,
This correctly “activates” either of
The log-likelihood is trivially,
The derivative of the log-likelihood (sometimes called the score) can then be shown to be,
In practice#
The way we have defined the log-likelihood we need
Connection with optimization#
There is a nice connection with regularization if we instead of MLE did maximum a posteriori (MAP) which simply maximizes,
Working with the potential functions instead (i.e. the log probability),
If we choose a Gaussian prior
Which is simply the negative log-likelihood with
Alternatives#
The two assumptions we made in the beginning can be altered:
The logistic function can we exchange with another sigmoid function, e.g. the standard normal cumulative distribution.
We could choose a different prior than Gaussian. However, it is particularly nice computationally since it makes the optimization objective strongly convex. Even if we try to sample the posterior
instead of optimizing with MAP this is desirable. This is because gradient based sampling techniques such as Unadjusted Langevin Dynamics (ULA) similarly benefits from the strong convexity (see e.g. Durmus and Majewski [2019]).
Resources#
A resource I found useful was Roman Garnett lecture notes.
Exercises#
Derive the gradient of the log-likelihood in (7).
Verify that Gaussian is equivalent to
-regularization in (8).
- DM19
Alain Durmus and Szymon Majewski. Analysis of langevin monte carlo via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019.