Regression analysis is, without a doubt, one of the most widely used machine-learning models for prediction and forecasting. There are many reasons for the popularity of regression analysis. The most significant of these are the following:

  1. It is simple, and who doesn’t love simplicity.
  2. It is usually very accurate, if you have the right features and a large amount of data.
  3. It is easy to interpret. For some applications like medical, interpretability is more important than accuracy!

This model is widely used, but is it used correctly? When someone uses a linear model, I usually ask the following three questions: How did you decide that regression is the right approach to pursue, and are you sure you picked the right function? How did you test the quality of the model? Finally, how do you know that the prediction from your linear model is confident enough?

In this post, we address these questions using the following three-step process. We use simple linear regression as a demonstration, but the techniques can readily be applied to any regression (and to some extent, any machine-learning algorithm). Although not sufficient, these steps are necessary while working with a regression model, but we often forget about them.

For this post, we use a modified version of the “cats” dataset in R and it is available here. The dataset has “the heart (Hwt) and body (Bwt) weights of samples of male and female cats”. The cats were all adult, over 2 kg body weight. For this experiment we ignore the gender. The R code is also available here. You can download it and play around with it.

Before Training: Visualization

Visualization is the initial step in any data science project. You need to plot your data in different dimension to have a sense of their behavior and structure. At Base, developers are mainly using ggplot (in both Python and R) to programmatically plot data before beginning an analysis. Why do we need visualization rather than simple statistical tests of the dataset? The Anscombe’s quartet dataset is a famous dataset that shows the necessity of visualizations. It contains four datasets that are statistically (mean, variance, and correlation) very similar, but quite different, when plotted.

So, let’s see if the cats’ dataset is a good dataset for linear modeling.

During Training: Goodness of Fit (GoF)

Aside from cross validation and the widely used RMSE metrics, GoF is another metric that you need to consider when you are applying a machine-learning model, especially a regression model. It is used to determine how well the model fits the training data. The two GoF metrics most often used for regressions are the R2 and p-value. R2 determines the predictability of the output value given the input variable and the model. If R2 is zero, you cannot predict the output from input because both are independent, but the high value of R2 reveals the strength of the predictive model.

To demonstrate, consider the following dataset. There is clearly no linear relation between x and y. What happens if we fit y=ax+b?

R-squared: 0.0002432437
p-value: 0.9143928

Clearly, linear model y=ax+b, is not a good model to use for these data. Instead, you could try the data using y=ax2+bx+c.

Now, let’s look at the GoF for our modified cats’ dataset:

R-squared: 0.5445796
p-value: 5.032721e-26

This is sufficient, so let’s build the model and move on to the prediction.

After Training: Confidence Interval

Now that the regression model is correctly trained, it’s time to use the model for prediction. This might seem simple at first, but there is some complexity to it. Clearly, for each given value of x, the prediction is still somewhat uncertain. In our example, if we would like to predict the height of a cat, given a weight of 3.0 kg, the resulting values range from 11.6–12.7 g. This is where we need to look at the confidence interval for the prediction.

The confidence interval is defined as a band around the predicted value that shows the amount of uncertainty in a prediction. The narrower the band is, the more certain you are about the prediction’s results. The level of uncertainty is defined by the user, and it is application specific. Most people work with an 80%–95% confidence interval. The results of this depend on three parameters:

  1. The variance of the output variable (y) in the training set. With low variance of y, the prediction is less uncertain.
  2. Number of data points: with more data points, the prediction is more certain.
  3. The value of x for prediction.

Let’s look at a 95% confidence interval for our prediction of Bwt=3.1 and Bwt=2.2, in the cats dataset.

Here, the band for Bwt=3.1 is narrower.

Bwt : 3.1
   Expected Hwt: 12.23026
   Low Bound: 11.86861
   Upper Bound: 12.59191

Bwt: 2.2
   Expected Hwt: 8.729203
   Low Bound: 8.308565
   Upper Bound: 9.149842

Summary

As it is shown, the above three steps are necessary when one is working with Regression models. These metrics, show the fitness quality of the regression model and the confidence in its prediction.

Cover photo by Michael Coghlan licensed with Attribution-NonCommercial 2.0 Generic License

Posted by

Share this article

  • Justin Veenstra

    To be absolutely precise, confidence intervals in frequentist statistics are strictly for estimated parameters. What you are calling confidence interval for prediction is usually just called a prediction interval. Slightly pedantic comment, perhaps, but there is enough confusion about statistics and like matters that using the correct terminology is important.

    • When in comes to prediction of a ML model, there is a difference between confidence interval and prediction interval. In pure statistical term, the confidence interval is the amount of uncertainty in E[Y|X=x] (uncertainty in mean response at x) and the prediction interval is amount of uncertainty in Y given X = x (uncertainty in new observation at x).

      Please refer to this link for more detail:
      http://statweb.stanford.edu/~susan/courses/s141/horegconf.pdf

      • Justin Veenstra

        That is exactly my point. Confidence intervals are for parameters. Yes, there is a mean level for each x value, the value E(Y|X = x). This is the regression line, as you know. Yes, there is, for any a in (0, 1), a (1 – a)% confidence level for this mean for any x. However, and this is important, this is not the confidence we have for a prediction. As is stated in your link, the standard error of this mean value has to be added to the random error associated with the new point.
        Because there is often confusion between the confidence interval associated with the mean and the prediction interval associated with a new point, the word confidence is usually left out of prediction interval.
        And to be perfectly clear, the word prediction is not associated with the confidence interval around the mean.
        That you were mixing the two terms is exactly what I was commenting on.

        • timcdlucas

          To illustrate this, only 2 out of 12 points lie within Bwt = 3.1 confidence interval. If this was a prediction interval we would expect 95% of data points to lie within the interval (11 out of 12 or so).

        • I believe we are on the same page. It just that when these terms are adapted by ML the meaning has been slightly shifted. When we talk about confidence interval it is usually means prediction interval.

  • Brandon Rohrer

    Great post! It looks like you may have switched your with- and without-confidence interval plots. Thanks for the advocacy for visualization as a first step.

    • Although everybody talk about the importance of visualization, few actually use it as first step.
      The first plot is confidence at training. The second is the prediction confidence at certain points.