Machine Learning, our main goal is to minimize the error which is defined by the Loss Function. And every type of Algorithm has different ways of measuring the error. In this article I’ll be going through some basic Loss Functions used in Regression Algorithms and why exactly are they that way. Let’s begin.
Suppose we have gotten 2 loss functions. Both functions will have different minima. So if you optimize the wrong loss function, you come to the wrong solution — which is the optimal point or the optimized value of the weights in my loss function. Or we can say that we are solving the wrong optimization problem. So we need to find the appropriate loss function which we will be minimizing.
The mean squared error (MSE) is a common way to measure the prediction accuracy of a model. It is calculated as:
MSE = (1/n) * Σ(actual – prediction)2
The lower the value for MSE, the better a model is able to predict values accurately.
- How to Calculate MSE in Python
- Introduction
- How to Interpret RMSE for a Regression Model
- Comparing RMSE Values from Different Models
- Which Metric Should You Use?
- How to Use RMSE in Practice
- Huber Loss
- Relative Standard Deviation (RSD) / Coefficient of Variation (CV)
- R Square/Adjusted R Square
- Metrics
- MAE
- MAPE
- SMAPE
- R2(R-squared)
- R-squared as Pearson’s Correlation Coefficient
- RMSLE
- Other metrics
- R Square, Adjusted R Square, MSE, RMSE, MAE
- There are 3 main metrics for model evaluation in regression
- Coefficient of Determination (R2)
- Range of prediction
- Mean/Median of prediction
- Standard Deviation of prediction
- The very naive way of evaluating a model is by considering the R-Squared value. Suppose if I get an R-Squared of 95%, is that good enough? Through this blog, Let us try and understand the ways to evaluate your regression model.
- Overall Recommendation/Conclusion
How to Calculate MSE in Python
We can create a simple function to calculate MSE in Python:
numpy np
mse(actual, pred):
actual, pred = np.array(actual), np.array(pred)
np.square(np.subtract(actual,pred)).mean()
We can then use this function to calculate the MSE for two arrays: one that contains the actual data values and one that contains the predicted data values.
The mean squared error (MSE) for this model turns out to be 17.0.
In practice, the root mean squared error (RMSE) is more commonly used to assess model accuracy. As the name implies, it’s simply the square root of the mean squared error.
We can define a similar function to calculate RMSE:
numpy np
rmse(actual, pred):
actual, pred = np.array(actual), np.array(pred)
np.sqrt(np.square(np.subtract(actual,pred)).mean())
We can then use this function to calculate the RMSE for two arrays: one that contains the actual data values and one that contains the predicted data values.
The root mean squared error (RMSE) for this model turns out to be 4.1231.
Mean Squared Error (MSE) Calculator
How to Calculate Mean Squared Error (MSE) in Excel
Introduction
As part of my role within the automated machine learning space with H2O.AI and Driverless AI, I have seen that many times people struggle to find the right optimization metric for their data science problems. This process is even more challenging in regression problems where the errors are often not bounded like you normally have with probabilistic modeling. One would expect that a “good” model would be able to get superior results versus all metrics available, however quite often this is not the case. This is a misconception. Commonly at the beginning of the optimization process, it is true that most metrics tend to improve, however after a while they reach a point where improvement in one metric may result in deterioration for another. I have encountered this multiple times when observing Mean Absolute Error (MAE) and Mean Squared Error (MSE). When I select a MAE optimizer for my model, I can see that in the first iterations of my algorithm, both MSE and MAE become smaller/better, however, after a while, only MAE improves while MSE becomes worse. In other words, when optimizing for a model, you can maximize the gain via optimizing for the metric you are most interested in, otherwise, you might be getting suboptimal results.
In this article, I will iterate through different common regression metrics and discuss some pros and cons for each metric as well as giving my personal recommendation for when it may be best to prefer one metric over another. For demonstration purposes, I would be using a subset of time series data from this Kaggle competition regarding sales forecasting. I would be predicting Weekly sales in different stores and departments for a retailer. The data spans for more than 140 weeks. I will be using the last 26 weeks for testing. I will be using H2O.ai’s Driverless A I to run my time-series experiments.This is the snapshot of the data:

The target’s distribution is right skewed with some fairly high values compared to the mean:

Regression analysis is a technique we can use to understand the relationship between one or more predictor variables and a response variable.
One way to assess how well a regression model fits a dataset is to calculate the root mean square error, which is a metric that tells us the average distance between the predicted values from the model and the actual values in the dataset.
The lower the RMSE, the better a given model is able to “fit” a dataset.
RMSE = √Σ(Pi – Oi)2 / n
How to Interpret RMSE for a Regression Model
Suppose we would like to build a regression model that uses “hours studied” to predictor “exam score” of students on a particular college entrance exam.

Exam Score = 75.95 + 3.08*(Hours Studied)
We can then use this equation to predict the exam score of each student, based on how many hours they studied:

We can then calculate the squared difference between each predicted exam score and the actual exam score. Then we can take the square root of the mean of these differences:

The RMSE for this regression model turns out to be 5.681.
Recall that the residuals of a regression model are the differences between the observed data values and the predicted values from the model.
Residual = (Pi – Oi)
And recall that the RMSE of a regression model is calculated as:
This means that the RMSE represents the square root of the variance of the residuals.
This is a useful value to know because it gives us an idea of the average distance between the observed data values and the predicted data values.
This is in contrast to the R-squared of the model, which tells us the proportion of the variance in the response variable that can be explained by the predictor variable(s) in the model.
Comparing RMSE Values from Different Models
The RMSE is particularly useful for comparing the fit of different regression models.
For example, suppose we want to build a regression model to predict the exam score of students and we want to find the best possible model among several potential models.
Suppose we fit three different regression models and find their corresponding RMSE values:
Model 3 has the lowest RMSE, which tells us that it’s able to fit the dataset the best out of the three potential models.
RMSE Calculator
How to Calculate RMSE in Excel
How to Calculate RMSE in R
How to Calculate RMSE in Python
Regression models are used to quantify the relationship between one or more predictor variables and a response variable.
Whenever we fit a regression model, we want to understand how well the model is able to use the values of the predictor variables to predict the value of the response variable.
MSE: A metric that tells us the average squared difference between the predicted values and the actual values in a dataset. The lower the MSE, the better a model fits a dataset.
MSE = Σ(ŷi – yi)2 / n
RMSE: A metric that tells us the square root of the average squared difference between the predicted values and the actual values in a dataset. The lower the RMSE, the better a model fits a dataset.
RMSE = √Σ(ŷi – yi)2 / n
Notice that the formulas are nearly identical. In fact, the root mean squared error is just the square root of the mean squared error.
Which Metric Should You Use?
When assessing how well a model fits a dataset, we use the RMSE more often because it is measured in the same units as the response variable.
Conversely, the MSE is measured in squared units of the response variable.
To illustrate this, suppose we use a regression model to predict the number of points that 10 players will score in a basketball game.

We would calculate the mean squared error (MSE) as:
The mean squared error is 16. This tells us that the average squared difference between the predicted values made by the model and the actual values is 16.
The root mean squared error (RMSE) would simply be the square root of the MSE:
The root mean squared error is 4. This tells us that the average deviation between the predicted points scored and the actual points scored is 4.
Notice that the interpretation of the root mean squared error is much more straightforward than the mean squared error because we’re talking about ‘points scored’ as opposed to ‘squared points scored.’
How to Use RMSE in Practice
In practice, we typically fit several regression models to a dataset and calculate the root mean squared error (RMSE) of each model.
We then select the model with the lowest RMSE value as the “best” model because it is the one that makes predictions that are closest to the actual values from the dataset.
Note that we can also compare the MSE values of each model, but RMSE is more straightforward to interpret so it’s used more often.
Introduction to Multiple Linear Regression
RMSE vs. R-Squared: Which Metric Should You Use?
RMSE Calculator
Huber Loss
The Huber loss combines the best properties of MSE and MAE (Mean Absolute Error). It is quadratic for smaller errors and is linear otherwise (and similarly for its gradient). It is identified by its delta parameter:
Huber loss is less sensitive or more robust to outliers in data than the MSE. It’s also differentiable at 0. It’s basically an absolute error, which becomes quadratic when the error is small. How small that error has to be to make it quadratic depends on a hyperparameter, 𝛿 (delta), which can be tuned. Huber loss approaches MAE when 𝛿 ~ 0 and MSE when 𝛿 ~ ∞ (large numbers.)
While R Square is a relative measure of how well the model fits dependent variables, Mean Square Error is an absolute measure of the goodness for the fit.
Mean Square Error formula
MSE is calculated by the sum of square of prediction error which is real output minus predicted output and then divide by the number of data points. It gives you an absolute number on how much your predicted results deviate from the actual number. You cannot interpret many insights from one single result but it gives you a real number to compare against other model results and help you select the best regression model.
Root Mean Square Error(RMSE) is the square root of MSE. It is used more commonly than MSE because firstly sometimes MSE value can be too big to compare easily. Secondly, MSE is calculated by the square of error, and thus square root brings it back to the same level of prediction error and makes it easier for interpretation.
from sklearn.metrics import mean_squared_errorimport mathprint(mean_squared_error(Y_test, Y_predicted))print(math.sqrt(mean_squared_error(Y_test, Y_predicted)))# MSE: 2017904593.23# RMSE: 44921.092965684235
MSE can be calculated in Python using Sklearn Package
Relative Standard Deviation (RSD) / Coefficient of Variation (CV)
There is a saying that apples shouldn’t be compared with oranges or in other words, don’t compare two items or group of items that are practically incomparable. But the lack of comparability can be overcome if the two items or groups are somehow standardized or brought on the same scale. For instance, when comparing the variances of two groups that are overall very different, such as the variance in the size of bluefin tuna and blue whales, the coefficient of variation (CV) is the method of choice: the CV simply represents the variance of each group standardized by its group mean
The coefficient of variation (CV), also known as relative standard deviation (RSD), is a standardized measure of the dispersion of a probability distribution or frequency distribution. It helps us in understanding how the spread is the data in two different tests
Standard deviation is the most common measure of variability for a single data set. But why do we need yet another measure, such as the coefficient of variation? Well, comparing the standard deviations of two different data sets is meaningless, but comparing coefficients of variation is not.
Coefficient of Variation (CV) Formula
from scipy.stats import variationvariation(data)
For example, If we consider two different data;
Data 1: Mean1 = 120000 : SD1 = 2000
Data 2: Mean2 = 900000 : SD2 = 10000
Let us calculate CV for both datasets
CV1 = SD1/Mean1 = 1.6%
CV2 = SD2/Mean2 = 1.1%
We can conclude Data 1 is more spread out than Data 2
R Square/Adjusted R Square
R Square measures how much variability in dependent variable can be explained by the model. It is the square of the Correlation Coefficient(R) and that is why it is called R Square.
R square formula
R Square is calculated by the sum of squared of prediction error divided by the total sum of the square which replaces the calculated prediction with mean. R Square value is between 0 to 1 and a bigger value indicates a better fit between prediction and actual value.
R Square is a good measure to determine how well the model fits the dependent variables. However, it does not take into consideration of overfitting problem. If your regression model has many independent variables, because the model is too complicated, it may fit very well to the training data but performs badly for testing data. That is why Adjusted R Square is introduced because it will penalize additional independent variables added to the model and adjust the metric to prevent overfitting issues.
#Example on R_Square and Adjusted R Squareimport statsmodels.api as smX_addC = sm.add_constant(X)result = sm.OLS(Y, X_addC).fit()print(result.rsquared, result.rsquared_adj)# 0.79180307318 0.790545085707
In Python, you can calculate R Square using Statsmodel or Sklearn Package
From the sample model, we can interpret that around 79% of dependent variability can be explained by the model, and adjusted R Square is roughly the same as R Square meaning the model is quite robust.
So let’s take the squares instead of the absolutes. The loss function will now become:
which is very much differentiable at all points and gives non-negative errors. But you could argue that why cannot we go for higher orders like 4th order or so. Then what if we consider taking 4th order loss function, which would look like:
Hence its gradient will vanish at 3 points. So it will have local minima as well — which are not our optimal solution. We need to find the point at global minima to find the optimal solution. So let’s stick with the squares themselves.
4. Mean Squared Errors (MSE):
Now consider we are using SSE as our loss function. So if we have a dataset of say 100 points, our SSE is, say, 200. If we increased data points to 500, our SSE would increase as the squared errors will add up to 500 data points now. So let’s say it becomes 800. If we increase the number of data points again, our SSE will further increase. Fair enough? Absolutely not!
The error should decrease as we increase our sample data as the distribution of our data becomes narrower and narrower (referring to normal distribution). The more data we have, the less is the error. But in the case of SSE, the complete opposite is happening. Here, finally, comes in our warrior — Mean Squared Error. Its expression is:
We take the average or mean of SSE. So more the data, the lesser will be the aggregated error, MSE.
Here as you can see, the error is decreasing as our algorithm is gaining more and more experience. The Mean Squared Error is used as a default metric for evaluation of the performance of most regression algorithms be it R, Python or even MATLAB.
5. Root Mean Squared Error (RMSE):
The only issue with MSE is that the order of loss is more than that of the data. As my data is of order 1 and the loss function, MSE has an order of 2. So we cannot directly correlate data with the error. Hence, we take the root of the MSE — which is the Root Mean Squared Error:
Here, we are not changing the loss function and the solution is still the same. All we have done is reduce the order of the loss function by taking the root.
SE was certainly not the loss function we’d want to use. So let’s change it a bit to overcome its shortcoming. Let’s just take the absolute values of the errors for all iterations. This should solve the problem.. right? Or no? This is how the loss function would look like:
So now the error terms won’t cancel out each other and will actually add up. So any potential problem with this function? Well, yes. This loss function is not differentiable at 0. The graph of the loss function will be:
Y axis is the loss function
The derivative will not exist at 0. We need to differentiate the function and equate it to 0 to find the optimum point. And that won’t be possible here. We won’t be able to solve for the solution.
Normalizing the RMSE facilitates the comparison between datasets or models with different scales. You will find, however, various different methods of RMSE normalizations in the literature:
You can normalize by
If the response variables have few extreme values, choosing the interquartile range is a good option as it is less sensitive to outliers.
RMSEP/standard deviation is called Relative Root Mean Squared Error (RRMSEP)
1/RRMSEP is also a metric. A value greater than 2 is considered to be a good.
There are also terms like, Standard Error of Prediction(SEP) and Ratio of the Standard Error of Prediction to Standard Deviation (RPD) which are mainly used in chemometrics.
I hope this blog helped you to understand different metrics to evaluate your regression model. I have used multiple sources to understand and write this article. Thank you for your time.
Mean Absolute Error(MAE) is similar to Mean Square Error(MSE). However, instead of the sum of square of error in MSE, MAE is taking the sum of the absolute value of error.
Mean Absolute Error formula
Compare to MSE or RMSE, MAE is a more direct representation of sum of error terms. MSE gives larger penalization to big prediction error by square it while MAE treats all errors the same.
from sklearn.metrics import mean_absolute_errorprint(mean_absolute_error(Y_test, Y_predicted))#MAE: 26745.1109986
MAE can be calculated in Python using Sklearn Package
Relative Absolute Error (RAE) is a way to measure the performance of a predictive model. RAE is not to be confused with relative error, which is a general measure of precision or accuracy for instruments like clocks, rulers, or scales. It is expressed as a ratio, comparing a mean error (residual) to errors produced by a trivial or naive model. A good forecasting model will produce a ratio close to zero; A poor model (one that’s worse than the naive model) will produce a ratio greater than one.
It is very similar to the relative squared error in the sense that it is also relative to a simple predictor, which is just the average of the actual values. In this case, though, the error is just the total absolute error instead of the total squared error. Thus, the relative absolute error takes the total absolute error and normalizes it by dividing by the total absolute error of the simple predictor.
Mathematically, the relative absolute error, Ei of an individual model i is evaluated by the equation:
Relative Absolute Error (RAE) Formula
where P(ij) is the value predicted by the individual model i for record j (out of n records); Tj is the target value for record j, and Tbar is given by the formula:
For a perfect fit, the numerator is equal to 0 and Ei = 0. So, the Ei index ranges from 0 to infinity, with 0 correspondings to the ideal.
In statistical modeling and particularly regression analyses, a common way of measuring the quality of the fit of the model is the RMSE (also called Root Mean Square Deviation), given by
from sklearn.metrics import mean_squared_errormse = mean_squared_error(actual, predicted)rmse = sqrt(mse)
where yi is the ith observation of y and ŷ the predicted y value given the model. If the predicted responses are very close to the true responses the RMSE will be small. If the predicted and true responses differ substantially — at least for some observations — the RMSE will be large. A value of zero would indicate a perfect fit to the data. Since the RMSE is measured on the same scale, with the same units as y, one can expect 68% of the y values to be within 1 RMSE — given the data is normally distributed.
NOTE: RMSE is concerned with deviations from the true value whereas S is concerned with deviations from the mean.
So calculating the MSE helps to compare different models that are based on the same y observations. But what if
The first two points are typical issues when comparing ecological indicator performances and the latter, so-called validation set approach, is pretty common in statistical and machine learning. One solution to overcome these barriers is to calculate the Normalized RMSE.
Metrics
The Root Mean Squared Error (RMSE) or Mean Squared Error (MSE, which is basically the same as RMSE without the squared root) is the most popular regression metric. If there was a king/queen of regression metrics, this would have been it! This is how it is computed:

Where y^i is the prediction and yi the actual target value. In other words, you square all the errors (or residuals as they call them) per sample/row, then sum them, divide by the total number of observations and take the squared root to bring the metric back to the original space (or you don’t in MSE).
A few attributes about this metric:
1) It is very popular– it is the metric that essentially standard linear regression optimizes/minimizes. It is also one of the oldest regression metrics.
2) It puts a heavier weight on the bigger errors. Smaller errors (that are for example less than 1.) will have an even lower contribution to the overall error after being squared, whereas bigger errors will have much more weight after being squared.
3) It is vulnerable to outliers. A large error in a given sample can have huge impact on the overall results and make an optimizer focus on reducing the error for that single sample, making the prediction for every other sample worse.
4) It is easily optimizable. This is because of the “squared” attribute, it makes it easily differentiable, something that gradient-based algorithms (like Stochastic Gradient Descent) can leverage.
5) Many well-known algorithms (like Lightgbm, Xgboost, Keras, etc), have an optimizer for it.
When to use it:
This metric is ideal when you cannot afford to have a big error. In other words, you may be comfortable having a slightly higher error on many samples as long as you never get an error that is too big. For example, when missing a prediction by (+-)200 is more than twice worse than missing it by 100, then RMSE (or MSE) is the metric to go.
For reference, this is what the set-up of the experiments looks like:

These are the results we get in the test data for different metrics:
A model optimized for RMSE can get an error of 3,658. Considering that the mean of the target in the training data was at the level of 20,000, this seems like a decent error. We can look at individual time series (for specific combinations of stores and departments) and see what the predictions look like.
For department 3 and store 39, we can see the actual (yellow) versus predicted (white) for the 26 weeks in the test data.

Note that there is a peak in August that also appears to be very seasonal/periodic as it has happened in every other year as well. The RMSE optimizer tried to close the gap for that prediction.
Moving on to the next metric.
MAE
The Mean Absolute Error (MAE) is also a popular regression metric. It is described as:

For each row, you subtract the prediction from the actual value and then take the absolute of that difference ensuring it is always a positive value. Then you just take the average of all these absolute differences.
1) MAE is also popular and as a bit of trivia, there is a never-ending discussion for which metric is better, MAE or RMSE. Clearly, it depends on the use-case.
3) All errors are analogously weighted in this metric. An error of 2 is twice as worse than an error of 1.
4) It is vulnerable to outliers(but less than RMSE).
5) It is not as easily optimizable. MAE is not differentiable at zero (when predictions are equal to the actuals) and depending on the distribution of the target, this may make different approximations for MAE better than others.
This metric is ideal when all errors are analogously important based on their volumes. This is quite often the case in finance where a loss/error of 200$ is twice as worse than a loss of 100$. Logically this is most often the case, however, human beings can be anelastic (or elastic) in certain areas of the error, hence metrics like RMSE are also very popular.
I ran an experiment with the same default parameters, selecting MAE as the scorer. These are the results:
As can be seen, the MAE is now lower/better than in the previous experiment which optimized for RMSE (1,883 vs 2,076) and RMSE is higher/worse (3,721 vs 3,658.8). This should reinforce the statement made at the beginning of this article that a better model in one metric, does not guarantee better performance in all other metrics – which is why it is very critical to understand all available metrics and choose the right one for your business case.
This is what the series for department 3 and store 39 looks like:

Although the predictions near the peak (which are highlighted with red) are “more off” than the ones from the equivalent RMSE experiment, the errors at the edges are smaller. The MAE optimizer “sacrifices” that peak to get the other (lower in volume) samples “more correct” (in absolute terms). Obviously, it does this for many stores and departments, but even from this graph, one can understand where each metric gives a higher intensity.
MAPE
Mean Absolute Percentage Error (MAPE): MAPE measures the size of the error in percentage terms (compared with the actual values). It is essentially MAE, but as a percentage, because each absolute error is divided by the (absolute) actuals.

This is probably the trickiest regression metric I have encountered. It gives me trouble most of the time I need to work with it. I think I might get a bit emotional describing this – I hope you don’t mind that!
1) It tends to be popular among business stakeholders. That is because it is easily comprehensible and/or consumable since it is represented as a percentage. E.g. “on average we get an error of x% from our model across all channels”.
2) The smaller it is the better. It should be noted that it can take values higher than 100%.
3) All % errors are analogously weighted. An error of 20% is twice as worse than an error of 10%
4) This error does not consider the volume/magnitude of the normal error. An error of 1,000$ where the actuals were 10,000 (e.g. 1,000$/10,000$ =10%) has the same contribution as an error of 1,000,000$ where the actuals were 10,000,000$ (e.g. 1,000,000$ /10,000,000$ =10%). On a different example, when the absolute error is 0.2$ and actual is 0.1$, the MAPE is 200%, 20 times higher than the above examples and will have 20 times more weight in the metric’s minimization. This also means that you could be reporting a smaller error, because you get all these small-volume cases close percentage-wise, while you are missing some with very high actual values by millions/billions. Every error becomes relative to the actuals.
5) The metric is not defined when the actuals are zero. There are different ways to handle this. For example, the zero actuals could be removed from the calculation or a constant could be added. Any treatment applied when the actuals are zero has its own shortcomings, hence this metric is not recommended in problems with many zeros. The Symmetric Mean Absolute Percentage Error (SMAPE) that we will examine later might be more suitable when there are many zeros. Another alternative would be to use theWeighted Absolute Percent Error (WAPE) formula. This is basically the same as MAPE, with the difference that first all the errors and all the actuals are summed and then you calculate the fraction of sum of absolute errors versus the sum of all the actuals. That way, it is quite unlikely that the actuals will be close to zero (depending on the problem of course), however your model may lose some of its capacity to capture the target’s variability as it no longer focuses on individual errors and can easily ignore predictions that are “very off” since they no longer have a huge impact on the overall metric. In other words, it can be a bit too insensitive to the target’s fluctuations.
6) It is not vulnerable to outliers in the same sense that RMSE and MAE are. That is one or two high errors that may not be enough to cause this metric to “go berserk”(!), especially if the actuals are very high too, however, problems can arise from the overall distribution of the target (see next point).
7) Big range and standard deviation with many zeros (or low values in general) in the target variable with cases that may not be easily predicted can cause this metric to explode from small percentages to huge numbers. I have often (sadly) seen MAPEs of 1000000000% (no kidding!). A difficult use case would be estimating daily stock market profit/losses for a portfolio (assuming a high budget). One day you may be winning 3 cents (0.03$), the next day 200,000$ and the day after that you lose 10,000 (which won’t happen easily if you use our tools, because our algorithms will anticipate it 😊😊). Now imagine predicting 100,0000$ for the next day (which is a perfectly plausible number based on historical values) and you end up making 1 cent. The MAPE for that case will be 999999.99$/0.01 = 9,999,999,900%! In these situations, a high range of possible values and unpredictable spikes in your target can cause MAPE to go completely “off”. Also, this will most likely halt MAPE’s optimization and force it to take some constant value. The best remedy I have seen in these situations is adding a constant value to your target in all samples, which needs to be sufficiently large to account for the possible range of values you could get. I treat this constant value as a hyper parameter for a given experiment. Putting this too high will damage your model’s ability to capture much of the variation within your target. Putting this too low will still give you abysmally high MAPEs. You need to run different experiments to find the value hat works best and remember to subtract that constant again after making predictions.
8) Not easily optimizable either.
9) Some packages have an optimizer for it. For example Tensorflow/Keras, lightgbm do. Bear in mind, there is no guarantee that these will always work – MAPE can be hard to optimize and may need a lot of tuning of the other hyper parameters of these models as well to make it work well.
I ran an experiment with the same default parameters, selecting MAPE. These are the results:
Note that the MAPE of 16.77% is the lowest encountered so far. MAE of 1998.4 is worse than one of the MAE’s experiments (of 1,883) and the RMSE of 3812.6 is worse than the one of the RMSE’s experiments (of 3658.8). Once again, optimizing for MAPE, make MAPE better, but the rest of the metrics become worse compared to the experiments that optimized directly for them. Looking at the remaining of the metrics as well as from experience, the MAPE optimizer’s results should be closer to that of MAE’s.
This is not very clear from the graph for department 3 and store 39:

What stands out about the graph is that there is almost never a zero error. The prediction line almost never touches the actual, albeit comes close to it. The previous two optimizers (RMSE and MAE) had cases where the error was zero (or very close to it). It does as well as RMSE on that peak though.
One last example before I move onto the next section. The test dataset I am scoring has 16,280 rows (e.g. 16.3K different combinations of stores and departments). We saw that the MAPE was at 16.77%. The actual value for Store 10 and department 2 on 03/08/2012 was 113,930.5 and Driverless AI predicted 112,740.76. The MAPE for that row is 1.04%. If we assume on that day, there were many returns (which constitutes a negative target) and/or the department was closed and the actual target was 0.01, then the MAPE for that row becomes 1,127,407,600%. The overall MAPE for all the rows now becomes 69,767.00%! That single bad prediction against the low actual target imposes a huge weight in MAPE’s calculation and will make you believe that the overall model is very (VERY) bad.
This metric is ideal when your target variable does not include a very big range of values and the standard deviation remains small. Ideally, the target would take positive values that would be far away from zeros with no unpredictable spikes or sudden ups or downs in its distribution. It is also useful when you want to easily explain the error in percentage terms and business stakeholders tend to like it.
SMAPE
The Symmetric Mean Absolute Percentage Error (SMAPE) can be a good alternative to MAPE. It is defined by:

Unlike the MAPE, which divides the absolute errors by the absolute actual values, the SMAPE divides by the mean of the absolute actual AND the absolute predicted values. This counters MAPE’s deficiency for when the actual values can be 0 or near 0. I will not be spending too much time in this metric as it is rarely selected.
1) Not as popular as MAPE. People would still prefer MAPE even though it has its shortcomings and struggles to make it work instead of switching to SMAPE. To be fair, SMAPE is not without its shortcomings either!
2) The smaller it is the better. Note that because SMAPE includes both the actual and the predicted values, the SMAPE value can never be greater than 200%.
3) It is NOT vulnerable to outliers.At worst a high actual compared to the predictions or a high prediction compared to the actual will be capped at 200%.
4) Might become too insensitive to the targets’ fluctuations. It is like a special MAPE case where a constant is added as explained in point 7 of MAPE’s attributes. Via always adding the prediction to the denominator, it can make the optimizer become too “relaxed” and not put much intensity to capture much of the variation within your target.
5) Not easily optimizable. For example, it is not differentiable when prediction and actuals are zero.
6) There are not many direct optimizers for metric this is well-known packages. I don’t know any to be honest. What has been somewhat efficient was to apply natural logarithm +1 transformation on MAE or MAPE which has a similar effect on reducing the impact of very high actuals. You may find this discussion on a Kaggle competition somewhat interesting on the topic. So, in practice, you use these (or other) target transformations as hyper parameters to tune against this metric.
When you cannot make MAPE to work properly and give you sensible values, but you want to still showcase a metric that can be interpreted as a % and make it more consumable and simpler to understand.
R2(R-squared)
R squared is quite likely the first metric you come across when you start learning about linear regression and evaluation/assessment metrics for it.
Calculating the R2value for linear a model is mathematically equivalent to:

Breaking down the elements of the formula:

In other words, SSE (also called the residual sum of squares) is the squared error (without the mean and the squared root) from the RMSE formula.
SST (or the total sum of squares) can be defined as:

Where y– is the mean of the target.
Going back to the R-squared formula, we essentially compare/divide the error of our model with the error produced by a very basic model that just uses the mean of the target as its only prediction. Hence this metric shows you how better is the model from a naïve or very simple prediction. In some cases, this formula can produce a negative value (if the model is essentially worse than just using the mean of the target).
1) It is very popular, it could challenge MSE in fame and is very closely related to it.
2) The higher it is, the better the model. It takes values from minus infinity to +1.
Bear in mind that even a weak model can be useful. This is where is fit to say that “all models are wrong but some are useful”. Sometimes a prediction that is slightly better than the average is still good enough to be useful. For example, trying to estimate the wind speed for the next 3 hours based on recent weather attributes, even a slightly better prediction than the average wind experienced in the previous x hours can be life-saving for whether airplanes should take-off. In practice we do get significantly better predictions than the average in predicting weather conditions, so don’t get too worried about it!
Conversely, a very high R-squared might not be good enough to be useful. For example, a marketing company has a deal that allows it to pay a fixed amount of 1,000$ to a mailing company and send 100,000 mails every day to different people advertising its products. Out of these 100,000 mails, the company generates 10,000$ income from people that buy the advertised products. Let’s assume that most of the income comes from a small proportion of the people contacted via mail. The company could save some money via opting for a different mailing package that only sends 10,000 with a fixed amount of 500$ (which smaller than the current amount of 1,000$ it pays for the 100,000 mails). The company decided to build a model that predicts the expected total income generated from a subset of 100,000 people, with the scope of keeping the 10,000 with the highest predictions that would allow it to opt-in for the cheaper mailing package and save some money. It builds a model with a very high R-squared (i.e. 0.9) in predicting expected revenue by a person. Within the 10,000 cases with the highest expected/predicted income It can accumulate 90% (or 9,000$ out of 10,000$) of the total income that it would have received if it had contacted all the 100,000 people. This sounds like a very strong prediction (albeit not perfect). However, the 10% of the income that is now missing (which is 1,000$ ) and resides within the 90,000 of the people that won’t get contacted with the new package is higher than the cost it saves from switching packages (which could be 1,000$-500$=500$). In this case, this strong model is not good enough to give the company profit and therefore is not useful.
4) It does not really tell you much about what the average error is. As stated previously it is a measure that tells you how better the model is than a very basic model. Hence it is advisable to track this metric along with RMSE, MAE or another measure that can also give you an estimate for the error too.
5) It is counter-intuitive for its scaling ability that it can take infinitely negative values.
6) It is optimizable using MSE or RMSE solvers.
7) Most tools/packages have an optimizer for it
When you want to get an idea about how good your model is against a baseline that uses only the mean of the target as a prediction. Ideally, it should be accompanied by another metric that measures the error in some form. It helps to compare/rank models’ performances that could be predicted very different things.
R-squared as Pearson’s Correlation Coefficient
In order to avoid the infinitely negative values R-squared could take which may beat the purpose for using R-squared (and its ability to scale and compare models), within Driverless AI R-squared is computed via squaring thePearson Correlation Coefficient. In the case of an MSE linear regression optimizer, the results should be the same as with the formula from the previous section. With other types of models R-squared could differ from this formula.
In this form, the R-squared value represents the degree that the predicted value and the actual value move in unison. The R-squared in this state varies between 0 and 1 where 0 represents no (linear) correlation between the predicted and actual value and 1 represents complete correlation. However, a disadvantage of this method (which is generally greatly minimized via using internally MSE optimizers) is that this implementation ignores the error completely in its calculation. It primarily “cares” for making predictions as analogous to the actuals as possible, ignoring the volumes. For example, these 2 models have the same R-squared.

The blue line has much smaller error, however both models have similar ability to anticipate changes on how the target moves. As stated before, this drawback is alleviated when R-squared is minimized using MSE-based solvers.
The R2of 0.96271 is the highest reported among all other experiments and the actual versus predicted look a lot like the MAPE’s experiment

RMSLE
This metric was always “nan” in all previous experiments and for a good reason as the dataset contains negative values in the target variable. The Root Mean Squared Logarithmic Error (RMSLE) measures the ratio between actual values and predicted values and takes the log (plus 1) of the predictions and actual values. The formula is defined by:

It can also be written as:

This is essentially the RMSE formula with the difference that the actual and the predicted values are transformed using the natural logarithm. The “plus one” element helps to include cases where the target is zero. The natural logarithm of zero cannot be defined, hence we add one. Why would applying the natural logarithm be useful?


The natural logarithm helps to bring the target values somewhat in the same (or a closer) level. In other words, this transformation penalizes harder the very big values and alleviates RMSE’s impact on these outliers (that are likely to impose/cause higher errors).
1) It is quite popular, especially in pricing and in cases where the target is positive.
2) The smaller it is, the better.
3) It puts heavier weight on the bigger errors after applying the logarithmic transformation;however, this transformation is already alleviating the potential of likely higher errors, hence the overall weighting is more balanced.
4) It is not vulnerable to outliers. Large errors are not as likely to occur because of the logarithm transformation – it almost puts a cap on high values.
5) It does not work with negative values in the target variable. The remedy is to add a constant. A quick fix is to add the smallest value encountered in the target plus one. However, there might be better optimal constants. Finding the right constant is a hyper parameter. See also attribute (7) for MAPE as similar logic applies for RMSLE’s best constant.
6) Like SMAPE, it might become too insensitive on the targets’ fluctuations because of the heavy penalization of higher values.
7) It is easily optimizable. In most cases you apply the natural logarithm on the target first and then use an RMSE optimizer
8) Some algorithms have an optimiser for it, but it is not necessary, because as explained in (7) you could manually apply the logarithm transformation on the target and solve using RMSE.
This metric is ideal when you have mostly positive values with a few outliers (or high values) that you are not so interested in predicting well and MAE (as well as RMSE) seem to be critically affected by them.
Other metrics
There were a few metrics that were not covered, but you can find the info in the links above. Giniis better to be analysed in the context of probabilistic modelling (and classification problems) in another article.
RMSPE(or Root Mean Square Percentage Error) is a hybrid between and RMSE and MAPE
MER(OR Median Error Rate) the same as MAE with the difference that instead of average we take the median.
R Square, Adjusted R Square, MSE, RMSE, MAE
Model evaluation is very important in data science. It helps you to understand the performance of your model and makes it easy to present your model to other people. There are many different evaluation metrics out there but only some of them are suitable to be used for regression. This article will cover the different metrics for the regression model and the difference between them. Hopefully, after you read this post, you are clear on which metrics to apply to your future regression model.
Every time when I tell my friends: “Hey, I have built a machine learning model to predict XXX.” Their first reaction would be: “Cool, so what is the accuracy of your model prediction?” Well, unlike classification, accuracy in a regression model is slightly harder to illustrate. It is impossible for you to predict the exact value but rather how close your prediction is against the real value.
There are 3 main metrics for model evaluation in regression
1. R Square/Adjusted R Square
2. Mean Square Error(MSE)/Root Mean Square Error(RMSE)
3. Mean Absolute Error(MAE)
Coefficient of Determination (R2)
R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model. Whereas correlation explains the strength of the relationship between an independent and dependent variable, R-squared explains to what extent the variance of one variable explains the variance of the second variable. So, if the R2 of a model is 0.50, then approximately half of the observed variation can be explained by the model’s inputs.
R Squared formula
R (Correlation) (source: http://www.mathsisfun.com/data/correlation.html)
from sklearn.metrics import r2_scorer2_score(Actual, Predicted)
Mean Squared Error (MSE) or Mean Squared Deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors — that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive (and not zero) is because of randomness or because the estimator does not account for information that could produce a more accurate estimate.
The MSE assesses the quality of a predictor (i.e., a function mapping arbitrary inputs to a sample of values of some random variable), or an estimator (i.e., a mathematical function mapping a sample of data to an estimate of a parameter of the population from which the data is sampled). The definition of an MSE differs according to whether one is describing a predictor or an estimator.
The MSE is a measure of the quality of an estimator — it is always non-negative, and values closer to zero are better.
Mean Squared Error (MSE) Formula
from sklearn.metrics import mean_squared_errormean_squared_error(actual, predicted)
Let’s analyze what this equation actually means.
Range of prediction
The range of the prediction is the maximum and minimum value in the predicted values. Even range helps us to understand the dispersion between models.
Mean/Median of prediction
We can understand the bias in prediction between two models using the arithmetic mean of the predicted values.
For example, The mean of predicted values of 0.5 API is calculated by taking the sum of the predicted values for 0.5 API divided by the total number of samples having 0.5 API.
In Fig.1, We can understand how PLS and SVR have performed wrt mean. SVR predicted 0.0 API much better than PLS, whereas, PLS predicted 3.0 API better than SVR. We can choose the models based on the interest of the API level.
Disadvantage: Mean is affected by outliers. Use Median when you have outliers in your predicted values
Fig.1. Comparing the mean of predicted values between the two models
In statistics, mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of measurement versus an alternative technique of measurement. It has the same unit as the original data, and it can only be compared between models whose errors are measured in the same units. It is usually similar in magnitude to RMSE, but slightly smaller. MAE is calculated as:
Mean Absolute Error (MAE) Formula
from sklearn.metrics import mean_absolute_errormean_absolute_error(actual, predicted)
It is thus an arithmetic average of the absolute errors, where yi is the prediction and xi the actual value. Note that alternative formulations may include relative frequencies as weight factors. The mean absolute error uses the same scale as the data being measured. This is known as a scale-dependent accuracy measure and, therefore cannot be used to make comparisons between series using different scales.
Note: As you see, all the statistics compare true values to their estimates, but do it in a slightly different way. They all tell you “how far away” are your estimated values from the true value. Sometimes square roots are used and occasionally absolute values — this is because when using square roots, the extreme values have more influence on the result (see Why to square the difference instead of taking the absolute value in standard deviation? or on Mathoverflow).
In MAE and RMSE, you simply look at the “average difference” between those two values. So you interpret them comparing to the scale of your variable (i.e., MSE of 1 point is a difference of 1 point of actual between predicted and actual).
In RAE and Relative RSE, you divide those differences by the variation of actual, so they have a scale from 0 to 1, and if you multiply this value by 100, you get similarity in 0–100 scale (i.e. percentage).
Let’s start by considering the most basic loss function which is nothing but the sum of errors in each iteration. The error will be the difference in the predicted value and the actual value. So the loss function will be given as:
Ŷ is the predicted value; Y is the actual value
Certainly not the best fit line you might say! But as per this loss function, this line is a best fitting line as the error is almost 0. For point 3 the error is negative as the predicted value is lower. Whereas for point 1, the error is positive and of almost the same magnitude. For point 2 it is 0. Adding all of these up would lead to a total error of 0! But the error is certainly much more than that. If the error is 0 then the algorithm will assume that it has converged when it actually hasn’t — and will exit prematurely. It would show a very less error value where in reality the value would be much larger. So how can you claim that this is the wrong line? You actually cannot. You just chose the wrong loss function.
Standard Deviation of prediction
The standard deviation (SD) is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set,. In contrast, a high standard deviation indicates that the values are spread out over a broader range. The SD of predicted values helps in understanding the dispersion of values in different models.
Standard Deviation Formula
In Fig.2, The dispersion of predicted values is less in SVR compared to PLS. So, SVR performs better when we consider the SD metrics.
Fig.1. Comparing the standard deviation of predicted values between the two models
The relative squared error (RSE) is relative to what it would have been if a simple predictor had been used. More specifically, this simple predictor is just the average of the actual values. Thus, the relative squared error takes the total squared error and normalizes it by dividing by the total squared error of the simple predictor. It can be compared between models whose errors are measured in the different units.
Mathematically, the relative squared error, Ei of an individual model i is evaluated by the equation:
Relative Squared Error (RSE) Formula
The very naive way of evaluating a model is by considering the R-Squared value. Suppose if I get an R-Squared of 95%, is that good enough? Through this blog, Let us try and understand the ways to evaluate your regression model.
Image source: Shravankumar Hiregoudar
Let us consider an example of predicting Active Pharmaceutical Ingredients (API) concentration in a tablet. Using absorbance units from NIR spectroscopy we predict the API level in the tablet. The API concentration in a tablet can be 0.0, 0.1, 0.3, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0. We apply PLS (Partial Least Square) and SVR (Support Vector Regressor) for the prediction of API level.
NOTE: The metrics can be used to compare multiple models or one model with different models
Overall Recommendation/Conclusion
R Square/Adjusted R Square is better used to explain the model to other people because you can explain the number as a percentage of the output variability. MSE, RMSE, or MAE are better be used to compare performance between different regression models. Personally, I would prefer using RMSE and I think Kaggle also uses it to assess the submission. However, it makes total sense to use MSE if the value is not too big and MAE if you do not want to penalize large prediction errors.
Adjusted R square is the only metric here that considers the overfitting problem. R Square has a direct library in Python to calculate but I did not find a direct library to calculate Adjusted R square except using the statsmodel results. If you really want to calculate Adjusted R Square, you can use statsmodel or use its mathematic formula directly.
Interested to see top metrics to evaluate the classification model? Refer to the link below:
There is no perfect metric. Every metric has pros and cons. A model that gives better results in one metric is not guaranteed to give you better results in every other metric. Knowing the strengths and weaknesses of each metric can help the decision-maker find the one which is most suitable for his/her use case and optimize for that. This knowledge can also help counter drawbacks that may arise from using a specific metric and can facilitate better model-making.






