Since the popularization of the box score in the 19th century, baseball fans have had a continually growing love affair with statistics. Those of us who grew up in the new millennium have had the opportunity to watch the statistical revolution drive the sport to even greater heights; largely through Brad Pitt’s stellar performance in the 2011 Blockbuster hit “Moneyball” as Oakland A’s manager Billy Beane who reinvented his small-market team by diving deeper into the statistics of the game than ever before. As we look into the present state of our favourite team – the Toronto Blue Jays, we hope to use a form of this popular sabermetric analysis to find a silver lining amid both the team’s poor performance over the past seasons and the delayal of the whole MLB 2020 season due to the recent COVID-19 pandemic. Currently, the Blue Jays’ roster and farm system are stacked with rookies who show great potential – at least that’s what we, as fans, choose to believe. To confirm our hopes for the team’s future, we have chosen to determine which MLB Rookie stat serves as the greatest indicator for career success in the league. We will use Python and its BeautifulSoup, numpy, and pandas libraries to create models based on Machine Learning methods which will reveal important correlations, validate our results, and ultimately, be capable of predicting the future of the Toronto Blue Jays.
Looking for Data
We began our search for data seeking to answer the question “What factors in a rookie season lead to a successful career in the major leagues?”. It was decided that we would find the statistics of a position player five years into their MLB career, and attempt to find a relationship with their rookie season statistics. As data collection began, we soon realized that a significant proportion of players who played full rookie seasons in the MLB did not make it to the five-year marker we had set. We decided to incorporate this information into our research by posing two separate questions; “What factors in a rookie season can determine if a player will remain in the MLB?” and “What factors in a rookie season lead to success in the MLB?”. With these goals in mind, we collected data for both scenarios, first finding rookie data from 2000-2014. In order to determine if a player would remain in the MLB for five years, we needed to create a program to gather player career lengths from FanGraphs and store the results. Finally, we found the statistics of each rookie’s fifth year in the MLB, ranging from 2005-2019.
To answer the first research question, our first task in Python was to take each rookie from 2000-2014 and map their career to success (1) or failure (0), where success simply implies their presence in the Major League after five seasons. We achieved this through a simple Python script, using BeautifulSoup (bs4) as a web scraping tool. Iterating through each rookie player from 2000-2014, their respective web page on FanGraphs was opened and the HTML was scraped. The program then checked the tag “Service Time” -a label which indicated how long a player had played in the Major League up to the end of the 2019 season. If the player’s service time was ≥ 5.0, the program added a 1 to a new column in their statistical report called “five_years”, while if the opposite was true, they received a 0 in that column.
The coding for our second research question began by fine-tuning the data that had already been collected. Since some players didn’t reach their fifth year, we wrote a simple program to eliminate any players for which we did not have sufficient data. This was done by comparing player_id values between the spreadsheet of rookie stats and the spreadsheet of stats from five years later. After this was completed, each player line in the spreadsheet had accurate stats from their rookie season and fifth year. Unfortunately, we soon realized that statistics from only a rookie season were simply not enough to construct an accurate model to predict fifth year production. This premature result informed us that a player’s offensive production cannot accurately be measured by only their rookie season. In their first season after the minor, players spend time adjusting to the new, heightened level of play. Thus, to ameliorate the accuracy of the model, minor league statistics were added to the training and testing data. Once we appended these to our master spreadsheet, coding began for the final model.
For the model which analyzed the statistics that play a role in determining whether a player makes a five-year career, we began by reading the csv file with the additional “five_years” column into the program as a numpy array. A correlation matrix was set and the relationships between every stat and “five_years” was noted. Right away, we noticed that none of the variables showed a high correlation, a quick reminder that the future in sports will always be challenging to predict. Subsequently, any feature that showed greater than 0.03 correlation with “five_years” was added to a feature list to be used for the model while “five_year” remained the target variable. We used the train_test_split function to do an 80/20 division for training with four models: GradientBoostingClassifier, RandomForestClassifier, AdaBoostingClassifier, and DecisionTreeClassifier. We then coded an ensemble learning method by inputting all the models to a VotingClassifier before testing. After fitting the model to the testing data, a classification report was generated to show the accuracy of the model and the importance of each feature was printed. Due to a low initial testing accuracy result, we checked the importance of each feature in the model and removed the lowest few from the feature list -including HBP (hit by pitch), SH (sacrifice hits), CS (caught stealing), and IBB (intentional walks). Each player’s team was also mapped to a value based on their specific division to see if this additional information could help the model prediction accuracy. Once we were satisfied with our model, feature importance could finally be determined.
To build our model to predict a rookie’s success five years into their MLB career, we began with our csv file containing each player’s minor league, rookie season and fifth year statistics. First the spreadsheet was read into our python program and the target variable was set as their fifth season WAR (Wins above Replacement). We decided upon WAR as our target variable since it is the sole statistic that encompasses all aspects of a player’s value; defensive and offensive. In our first test model, we included every available variable from both their rookie season and minor league career in order to determine which ones were the most significant indicators and therefore should be kept. The train_test_split function was then used to divide the data into 80% for training and 20% for testing. Finally, we used the Random Forest Regressor to generate our model and determine feature importances.
The complete code can be found at https://github.com/catherinewu12/MLB-RookieFuturePredictor
Results and Discussion
The first model had a high training accuracy score of ~0.98 and a testing score which topped out at ~0.65 after attempting all the changes referred to in the Code section. From the feature importance, we found that throughout every version of the model, SO (strikeouts) remained the consistent top variable of importance. Figure 1 shows the importance of each variable to our model as a percentage. The specific values can be seen in Table 1.
These results show that SO and AVG are the most important factors when considering a rookie’s potential to still be playing in the MLB five years down the road. Looking at previous research, Chandler et al. showed that Strikeouts per At-bat in rookie ball and AVG in Triple-A ball are highly important stats for predicting early draft picks’ futures in the MLB, reflected in the results of our model as well . It is pretty clear for most of us that AVG would be an important indicator of a player’s talent and potential in their rookie year as it displays the player’s success at the plate. Conversely, strikeouts may seem like a much more trivial statistic; however, we need to remember that a strikeout stat relates to both a player’s ability to make contact against high-level pitchers, and their discipline at the plate -skills that a successful rookie needs to possess.
Our second model’s goal was to determine the statistics that demonstrated the highest correlation to WAR five years into a player’s career. The final model had a training accuracy of approximately 0.90 and a testing accuracy of approximately 0.30. Considering the difficulty of predicting a player’s WAR and the infinite amount of variables that could affect this number, we were satisfied. The training score of 0.90 also confirmed that we could be quite confident in the accuracy of our feature importances. Overall, the statistic that consistently took the top spot in our model was minor league wRC+ (Weighted Runs Created Plus). As the MLB describes, wRC+ “combines a player\’s ability to get on base with his ability to hit for extra bases. Then it divides those two by the player\’s total opportunities and adjusts that number to account for important external factors — like ballpark or era” . While minor league wRC+ topped the feature importances by a significant margin, the top five statistics found were, Weighted Runs Created Plus (minors), Defensive Runs Above Average (rookie), Wins Above Replacement (rookie), Batting Average on Balls in Play (minors) and Weighted Stolen Bases (minors). This can be seen in the figure below.
These results make a great deal of sense when put into context. The top two statistics are essentially comprehensive offensive and defensive measures of a player. wRC+ computes a player’s overall offensive capabilities, as it combines hits, walks and total bases. Def determines the number of runs saved by a player defensively and adjusts it based on their fielding position . Since defensive performance is not much affected by the jump from the minor leagues to the major leagues, it is understandable that this variable comes from rookie season stats. Since these two made up a significant proportion of total importance, we also created a plot of each player’s rookie season Def and minor league wRC+, as well as colour based on their fifth year WAR. From the figure below, we can see that those with higher later WAR values (Red) generally had a high value of one of wRC+ and Def, or had a high combination of the two.
Another interesting point to explore from these results is the importance of wSB, an estimation of the number of runs a player contributes to his team by stealing bases. This was particularly intriguing, as today we live in an era of long ball hitters. We don’t expect our heavy hitters to contribute by stealing a base here or there, or perhaps going first to third, but ultimately it clearly makes an important contribution to their team’s wins. This demonstrates the importance of manufactured runs, so maybe someday soon we’ll see a resurgence of small ball in the MLB. Looking back on previous research in this domain, we see that in the study KATOH: Forecasting Major League Hitting with Minor League Stats, author Chris Mitchell produces similar results, displaying the importance of minor league BABIP, BB% and SB (having not included wRC+ in his study) .
Finally, we have reached the most exciting section: making predictions for some of the 2019 season’s rookies. As Jays fans, we have included some of our highest hopes for the future: Vladimir Guerrero Jr. and Bo Bichette, as well as some other players to ensure our results aren’t skewed (Let’s just say not all Blue Jays rookies are going to be stars). We’ve also included some of 2019’s most impressive rookies to see what the future may hold for the league’s most promising talent.
Fortunately for Jay\’s fans, our model predicts that Bo Bichette is going to be one of the best future players in the league, along with stars like Victor Robles and Juan Soto. When baseball season comes back to us (hopefully soon!), we will have to keep our eye on these promising players!
From this project, we were able to use the power of data and Machine Learning to determine and confirm the importance of many rookie stats in the MLB. However, despite our fascinating results and hopeful predictions, it is important to remember that no matter how much manipulation and AI we put the numbers through, the game of baseball could never be fully explained simply by our code. The nuances of balancing defence and offense, combined with unforeseeable injuries and other completely unpredictable behind-the-scenes situations (see COVID-19 and trash can banging) all contribute to each player’s potential future in the MLB.
For our specific models, a deeper analysis into defensive abilities and fielding position are both factors that should be examined in the future. Furthermore, a study on pitchers could also be explored -maybe someone could produce a timeline for Aaron Sanchez’ next blister?
Ultimately, we, as programmers, hope to continue gathering data and engineering features as we augment the predicting capabilities of our models; while, we, as fans, will take these predictions and excitedly look towards the next five years of Blue Jays baseball.
 Chandler G, Stevens G, An Exploratory Study of Minor League Baseball Statistics. J Quant Anal Sports. 2012;8:10.1515/1559-0410.1445.
 What is a Runs Created (RC)?: Glossary. Major League Baseball. Available from: http://m.mlb.com/glossary/advanced-stats/runs-created
 Mitchell C, KATOH: Forecasting Major League Hitting with Minor League Stats. The Hardball Times. 2014.
 Weinberg N. Def. Fangraphs Sabermetrics Library. 2014. Available from: https://library.fangraphs.com/defense/def/