Regression is arguably one of the most fundamental tasks in machine learning (and statistics).
Given some datapoints and corresponding target values, we want to find a function that “predicts” target values for new input datapoints. For example, our data might contain features about houses (square area, location, number of floors, whether there’s a view, etc) and the target might be the price of the house. A regression model could be fit on such data, and produce a predicted price given the features of a new house.
I’ve encountered this topic many times in different contexts, with different terminology, mathematical tools and assumptions used in order to formalize it. Usually terms like “likelihood” and “posterior” are used. There are also some “well known” facts (to those who know them well) which are often mentioned in passing as trivial or easy to show.
This post is my attempt to explain this topic to myself, and to highlight the connections between various ways in which regression problems are often formulated. Specifically, it will cover:
- The basics of supervised learning
- A probabilistic view of regression
- MLE vs MAP
- Bayesian regression, and how it compares to MLE and MAP
- How the squared error loss is a consequence of MLE
- Why L2 regularization is equivalent to a Gaussian prior
Supervised learning
Let’s start by setting the scene. In the context of supervised learning, we’re trying to estimate an unknown function from labelled data, possibly noisy and incomplete. There are three main ingredients: the data, the model and the objective.
Data
We are given a dataset of
is a sample, or feature vector. is the label. Here we’ll focus on simple regression tasks, where we’ll usually have for simple univariate regression.
All we’re assuming here that each labeled sample was independently drawn from some unknown joint distribution
Note also that from this perspective, our given dataset
Model
Our model is a parametrized function
For example, here are a few very common hypothesis classes.
-
Linear: a simple affine transformation of the features.
Note that the model parameters are
. -
Linear with fixed basis functions: as above, but where an arbitrary (but fixed) transformation
is applied to the features. This transformation can be thought of as a feature extraction or pre-processing step. Note that this is still a linear model, since it’s linear in the model parameters (weights), but it’s far more expressive since the relationship between the input features and output can be arbitrarily complex.
-
Perceptron: a non-linear function applied the output of an affine transformation.
Here
is some nonlinear function, for example a sigmoid. This hypothesis class is used e.g. in logistic regression. -
Neural network: a composition of
layers, each with multiple perceptron models. Where the
are weight matrices and bias vectors, respectively. In this context, the humble perceptron model is called a neuron instead (due to the biologically inspired origin of this model). Note that each row of each weight matrix corresponds to a separate neuron.
In any case, our model outputs a prediction, which we’ll denote as
The number of parameters in the hypothesis class is sometimes referred to as the model’s “capacity”. Intuitively, the greater the capacity, the better it can fit the given data.
Objective
We’d like our model’s outputs for the given input features to be “close” in some way to given target values. The objective function is what measures this “closeness”, so we want our model’s predictions to maximize it with respect to the given data. In machine learning, the objective is usually defined as a loss, i.e. a function we want to minimize.
A pointwise loss is some function
Ideally, we would like to find a predictor function (hypothesis)
This is known as the population loss or out-of-sample loss. Notice that it’s not computed on the dataset, but over the true distribution. This is actually a deterministic quantity: it’s the expectation of a random variable.
But can we actually solve this problem, i.e. find the minimizer for
No, because the joint-distribution
Instead, given our training set
Is the empirical loss also a deterministic quantity?
No, because it depends on the randomness of the dataset sampling. In other words, it’s computed only on a single realization
Regularization
The objective may be augmented with an additional term
Regularization is added in order to “encourage” the model to have some additional properties. A common reason to add it is to prevent our model from being too dependent on the specific data in
Two very common regularization terms are,
- L2, i.e.
. Adding this results in what’s known as Ridge regression. This penalizes large magnitudes of the parameter values. We’ll see why it makes sense later on. - L1, i.e.
, which produces Lasso regression. This also penalizes large magnitudes, but in addition promotes parameter sparsity (but that’s for another post).
Choosing the squared error pointwise loss and adding L2 regularization would give us the following objective to minimize,
where
The probabilistic view
Let’s go a bit deeper and consider the actual assumptions being made to formulate the supervised learning problem.
Suppose we further assume that given an input feature
where
Now, instead of just predicting one target value
To estimate the unknown conditional distribution, we can parametrize it as
which means that the conditional probability distribution is Gaussian, centered around our model’s prediction, and with a fixed variance
Given these assumptions, how can we estimate the distribution of interest,
We’ll look at three different options: maximum likelihood estimation (MLE), maximum a posteriori (MAP), and Bayesian inference.
Maximum likelihood estimation
The parametrized distribution
We’re using the fact that the dataset was sampled i.i.d., to write its total probability as a simple product.
In practice, working with products is inconvenient. The common trick to dealing with such cases is to maximize the logarithm of the total likelihood instead: because log is a monotonic function, we’ll obtain the same maximizer,
To proceed, we’ll plug in our assumption about the specific form of the likelihood: it’s a Gaussian distribution, i.e.
Plugging-in the above and taking the log, we’re left with
To get the last equation, we dropped the first term and the
A sharp-eyed reader will immediately notice that we’re left with a simple minimization of the squared error pointwise loss over our data.
This shows why the squared error pointwise loss is such a natural choice: it arises directly from likelihood maximization, if we’re willing to make the fairly reasonable assumption that the data generating being a deterministic function of
We can now also empirically estimate the variance
which is the standard variance estimator for our data, computed around the mean given by our maximum-likelihood model.
We thereby achieved our goal: now, given a new input
Maximum a posteriori estimation (MAP)
One thing that might seem un-intuitive above the MLE approach, is that we’re maximizing the probability of observing the data we have, given some model parameters. But in practice, what we face with supervised learning is the other way around: we are given the dataset
In other words, it seems to make more sense to maximize the probability of the parameters given the data, i.e.
In the Bayesian terminology,
The relation between these probability distributions is given by Bayes’ rule,
and a MAP estimate seeks the model parameters which maximize this posterior. Because the
which is very similar to the MLE problem above — it’s just multiplied by the prior.
At this point we need to introduce another assumption so that we can proceed with using the prior. We’ll assume that the model parameters come from a zero-mean multivariate Gaussian distribution:
This assumption is not very restrictive, yet leads to an interesting result which we’ll see shortly. We can think of
Another advantage of this assumption from an analysis perspective, is that the resulting maximization objective
We can now plug this prior into the MAP estimate together with our likelihood from the previous section. We’ll also condition on the variance terms to treat them as constants. This gives us,
This expression may seem a bit unwieldy, until we remember that both these distributions are just Gaussians, and that we have a neat trick which breaks these products into sums while simultaneously dropping exponents. By taking the log and removing constant terms from the maximization, we’re left with,
which can be re-written equivalently as the minimization,
If this expression seems familiar, that’s because this is nothing but the MLE objective with an L2 regularization term.
The key takeaway here is that adding L2 regularization is exactly equivalent to a Gaussian prior on the model parameters. It turns the MLE into a MAP.
We can also see that the regularization strength is
- The regularization strength is inversely proportional to the variance of the prior. The more we’re confident that our prior is informative (low
), the more regularization we should add (high ), and vice versa. - For a fixed prior-strength (
), we should use stronger regularization if the dataset noise is strong (high ). This corresponds to the notion that regularization prevents overfitting to the noise in the data.
Bayesian Regression
In both the MLE and MAP approaches, we ended up with a point estimate for the model parameters, i.e. a single value (
In these cases, it was “best” in the sense that it corresponds to the location of the maximal value (mode) of either the likelihood
Consider the posterior
A fully Bayesian approach would take the full posterior distribution into account, not just its mode. It does this by marginalizing over
In the Bayesian case, we estimate the posterior-predictive distribution, which in our notation is
Estimating the posterior-predictive by marginalizing over
The first term in this integral is the likelihood for a single new datapoint
In the more general case where we don’t assume a Gaussian prior and likelihood, there might not be any way to calculate this integral analytically. Instead, numerical methods must be used to approximate it, making it more computationally intensive to produce each prediction. This might be one of the reasons that the fully-Bayesian approach is perhaps less often used in practice.
Conclusion
The classic machine learning task of regression, often viewed as merely fitting a function and obtaining point predictions, can also be viewed as fitting a probability distribution from which predictions can be sampled.
When taking this view, we need to contend with (at least) three common approaches for fitting the parameters of this distribution. Maximum Likelihood Estimation (MLE), Maximum A Posteriori (MAP), and Bayesian regression. In this post, we saw how these techniques are all interrelated within the same probabilistic framework.
MLE focuses on finding parameters that maximize the likelihood of the observed data, while MAP extends MLE to incorporate prior beliefs about the model parameters. We saw how under some reasonable assumptions, MAP naturally leads to L2 regularization (known as weight decay in other contexts). By incorporating a prior, MAP mitigates the tendency of MLE to overfit. Bayesian regression takes this a step further by accounting for the full posterior distribution of the parameters, whereas MLE and MAP only consider a single point on this distribution.
References
The following sources helped me understand these topics.
- Pattern Recognition and Machine Learning (Bishop, 2006)
- Machine learning: A Probabilistic Perspective (Murphy, 2012)