Regression with Gaussian Likelihood

In a future post, I would like to implement a (deep) quantile-regression model since I have never tried that before. To warm-up, let us start with a basic Gaussian regression.

The plot above represents a simple 1D dataset ${(x_i, y_i)}_{i=1}^N$ . To build a predictive model, we could simply try to fit a regression of the type

y_{i} \approx F (x_{i})

for some unkown function $F:$ . For implementing this, one could parametrize the function $F$ with some parameter $^{d}$ (eg. a neural net) and minimize the standard MSE loss,

θ \mapsto \frac{1}{N} \sum_{i = 1}^{N} {∣ y_{i} - F_{θ} (x_{i}) ∣}^{2} .

This simple regression setting does not capture the fact that the uncertainty is much higher at some places than at others. Better would be to use a model of the type

y_{i} \sim N (μ (x_{i}), σ (x_{i})^{2})

where $N(, ^2)$ denotes a Gaussian distribution with mean $μ$ and variance $^2$ . Here $(x)$ and $(x)$ are two unkown functions. In other words, one can make the variance term $^2(x)$ depends on the covariate $x$ , and that is the standard approach in this type of situations. Indeed, any other distribution (eg. a Student’s distribution) could be used instead. For implementing this idea, one can use a neural net with parameter $^d$ that takes $x$ as input and spits out the pair $((x), )$ . Maximum Likelihood Estimation boils down to minimizing

θ; \mapsto \frac{1}{N} \sum_{i = 1}^{N} {∣ y_{i} - μ_{θ} (x_{i}) ∣}^{2} + \log [σ_{θ}^{2} (x_{i})] .

Using a neural-net with only one hidden layer and a basic SGD optimizer gives the following learning trajectory:

Not bad, although it is known that this simple approach can lead to optimization issues in slightly more complex situations 1,2. For example, it is hard to get out of the local minimum below.