Regression analysis is, without a doubt, one of the most widely used machine-learning models for prediction and forecasting. There are many reasons for the popularity of regression analysis. The most significant of these are the following:
- It is simple, and who doesn’t love simplicity.
- It is usually very accurate, if you have the right features and a large amount of data.
- It is easy to interpret. For some applications like medical, interpretability is more important than accuracy!
This model is widely used, but is it used correctly? When someone uses a linear model, I usually ask the following three questions: How did you decide that regression is the right approach to pursue, and are you sure you picked the right function? How did you test the quality of the model? Finally, how do you know that the prediction from your linear model is confident enough?
In this post, we address these questions using the following three-step process. We use simple linear regression as a demonstration, but the techniques can readily be applied to any regression (and to some extent, any machine-learning algorithm). Although not sufficient, these steps are necessary while working with a regression model, but we often forget about them.
For this post, we use a modified version of the “cats” dataset in R and it is available here. The dataset has “the heart (Hwt) and body (Bwt) weights of samples of male and female cats”. The cats were all adult, over 2 kg body weight. For this experiment we ignore the gender. The R code is also available here. You can download it and play around with it.
Before Training: Visualization
Visualization is the initial step in any data science project. You need to plot your data in different dimension to have a sense of their behavior and structure. At Base, developers are mainly using ggplot (in both Python and R) to programmatically plot data before beginning an analysis. Why do we need visualization rather than simple statistical tests of the dataset? The Anscombe’s quartet dataset is a famous dataset that shows the necessity of visualizations. It contains four datasets that are statistically (mean, variance, and correlation) very similar, but quite different, when plotted.
So, let’s see if the cats’ dataset is a good dataset for linear modeling.
During Training: Goodness of Fit (GoF)
Aside from cross validation and the widely used RMSE metrics, GoF is another metric that you need to consider when you are applying a machine-learning model, especially a regression model. It is used to determine how well the model fits the training data. The two GoF metrics most often used for regressions are the R2 and p-value. R2 determines the predictability of the output value given the input variable and the model. If R2 is zero, you cannot predict the output from input because both are independent, but the high value of R2 reveals the strength of the predictive model.
To demonstrate, consider the following dataset. There is clearly no linear relation between x and y. What happens if we fit y=ax+b?
R-squared: 0.0002432437 p-value: 0.9143928
Clearly, linear model y=ax+b, is not a good model to use for these data. Instead, you could try the data using y=ax2+bx+c.
Now, let’s look at the GoF for our modified cats’ dataset:
R-squared: 0.5445796 p-value: 5.032721e-26
This is sufficient, so let’s build the model and move on to the prediction.
After Training: Confidence Interval
Now that the regression model is correctly trained, it’s time to use the model for prediction. This might seem simple at first, but there is some complexity to it. Clearly, for each given value of x, the prediction is still somewhat uncertain. In our example, if we would like to predict the height of a cat, given a weight of 3.0 kg, the resulting values range from 11.6–12.7 g. This is where we need to look at the confidence interval for the prediction.
The confidence interval is defined as a band around the predicted value that shows the amount of uncertainty in a prediction. The narrower the band is, the more certain you are about the prediction’s results. The level of uncertainty is defined by the user, and it is application specific. Most people work with an 80%–95% confidence interval. The results of this depend on three parameters:
- The variance of the output variable (y) in the training set. With low variance of y, the prediction is less uncertain.
- Number of data points: with more data points, the prediction is more certain.
- The value of x for prediction.
Let’s look at a 95% confidence interval for our prediction of Bwt=3.1 and Bwt=2.2, in the cats dataset.
Here, the band for Bwt=3.1 is narrower.
Bwt : 3.1 Expected Hwt: 12.23026 Low Bound: 11.86861 Upper Bound: 12.59191 Bwt: 2.2 Expected Hwt: 8.729203 Low Bound: 8.308565 Upper Bound: 9.149842
As it is shown, the above three steps are necessary when one is working with Regression models. These metrics, show the fitness quality of the regression model and the confidence in its prediction.