In baseball statistics, an earned run average (ERA) is the mean of earned runs given up by a pitcher per nine innings pitched. I decided to take a deeper look to see what goes into the ERA of a pitcher. In this study, I divided all the qualified pitchers from the last five years into two groups; top 10 ERA and non-top 10, as a means to determine what makes a top 10 ERA pitcher.
Using four indicators; strikeout percentage, walk percentage, left on base percentage, and BABIP (batting average on balls in play) we can figure out the probability that a pitcher will finish in the top 10 in ERA. I ranked all the pitchers in the last five seasons by these categories, and put them into a big matrix of numbers based on their rankings. To indicate if they finished in the top 10 ERA category, I put a 1 for top 10, and a 0 for finished outside the top 10. I used each pitcher’s yearly rank instead of their actual numbers because each year’s top 10 is different. Therefore, it is important to compare numbers on a year- to-year basis.
Some of the chart looks like this:
To find a prediction, I used a program in R called XGBoost. XGBoost takes the information based on the previous data and tests to see if there is a pattern between where the pitcher finished in rank, and if he finished in the top 10 of ERA in the season. After running the numbers with different parameters on XGBoost we can determine two things. The program tells us which of the four stats is most indicative of a high ERA rank, and which pitchers were outliers (the model predicts the outcome).
First, let’s look at which stat is the most predictive in determining the rank. Surprisingly, LOB rank has the most impact on a pitchers ERA rank. Note that these aren’t percentages, rather they are used to show the relative importance in each stat in predicting ERA.
This chart shows that where the pitcher finishes in LOB percentage is the best predictor. Interestingly enough, the pitcher that had the highest LOB percent (he left the highest percentage of runners on base) each of the last five years finished in the top 10 in ERA. Also, out of the pitchers that finished in the top five LOB percentage, 20 out of the 27 (there was one three-way tie) finished in the top 10. The chart also shows that LOB rank and K rank are much more significant than BB rank or BABIP rank.
Next, let’s look at the predictive aspect of the model. I ran the model using a number of different combinations of test and training data, and then had it predict on the pitchers. The model predicted around 85 percent of the pitchers correctly. Now, let’s look at a few pitchers that the model incorrectly predicted and why this data was wrong.
Garrett Richards finished the 2014 season with a 2.61 ERA, which placed him 10th in the MLB. However, the model predicted that Richards would finish outside of the top 10 with those ranks. One explanation for why Richards finished with a good ERA is his HR rate. He had a 0.27 HR/9 rate in 2014, which was the lowest of any qualified pitcher in the last five years. So, while he allowed a lot of baserunners, not a lot came in because of the fact that he could keep the ball in the yard. Richards has been injured the last few years, but his success has been almost completely related to his home run rate.
Stroman in 2017 had an ERA of 3.09, which placed him 9th. What Stroman lacks in strikeouts, he made up for in his ground ball to fly ball rate, as well as his groundball percentage. This allowed Stroman to get easy outs without needing to strike everyone out. Since he got so many groundballs, most of the hits he gave up were singles, which limited the amount of earned runs. He also induced the most double plays in 2017, which helped him get out of innings without allowing any earned runs.
One problem with this model is that it treats everyone outside the top 10 as equals. In 2015, Scherzer had a 2.79 which was the 11th best in the MLB. Even though he finished with a great ERA, the reason he didn’t make it into the top 10 was because of the amount of HR he allowed. He gave up 31 HR which was the most in the NL. Even though he finished in the top 10 in these four stats, his home runs prevented him from being in the top 10 in ERA.
One of the more interesting results was that the model projects Fiers in the top 10 even though he had a 3.56 ERA, finishing 24th in 2018. The reason why his LOB rank is so good, while still consistently giving up runs, is because he gave up the second most HR/9 of anyone in the MLB. While the rest of his numbers look good, like Scherzer, home runs prevented Fiers from having an elite ERA.