Predicting anything in sports is near impossible as factors such as injuries and human error prevent us from being able to create a consistently accurate model of what is and isn’t going to happen in the future. However, year after year there are endless new programs, breakthroughs that usually for a limited time are able to correctly predict to some measure a winner or star player. To give you a little background knowledge on me, you should know that on top of being a sports enthusiast, I, like many other sports fans around the world, participate in many leagues for fantasy sports. What this means in a broad sense is I make my own “team” of players from various teams in one league and go against the teams of my friends and family, and the team with the most points at the end wins. Upcoming in April (or what was supposed to be, postponed now because of COVID-19) is the NHL playoffs, a fan favourite fantasy sport among sports fans. A very big strategy in these fantasy leagues are to pick players from the NHL team you think is going to win the Stanley Cup. So, I tasked myself with creating an NHL predictor program using Python with the help of Numpy and Pandas, to predict which team in the NHL is going to win the Stanley Cup, to give myself that edge and crown myself victorious over my peers.
Looking For Data
This will be a quick little section giving you some background on how I started this project and how I had a couple other ideas I was trying to do before this. If you’re not interested in this (which I wouldn’t be surprised, who cares what I tried before that didn’t work), then move on to the next section titled Starting Out. My first plan was to see where championship teams spend their money. I had my programs done, my information gathered and graphs and tables that were in the works of finding that trend of where the majority of teams in the last 25 years have spent their salary cap. Unfortunately, the day after I wrote the code for the graphs I ran into a bit of an issue. The website I was getting my information spotrac.com from changed their privacy settings and I now needed to create a premium account on their website to regain access to the data I so very much sought out. So instead I looked for other websites with the same data but came up empty handed. Every website refused access to player by player salary cap information for years before 2017. The one website that allowed it, well, let’s just say it wasn’t even close to accurate. So, after much thought, I ended up changing my project into something a little different, but still required a list of the past champions from respective sports.
Like any project, the first thing I had to figure out was where and how to start. There exists countless data on all sports and teams everywhere on the internet, but I soon realized I didn’t even know what I was looking for in the first place. So thus began the birth of my first web scraping program titled “get_champ_teams” and its successor “get_graphs”. The first programs duties were pretty simple, it gave me a list of every Stanley Cup champion of the past x years (when you run the program you have the choice of how many years of champions you want, we’ll see why this is important later). For simplicity’s sake, let’s say I ran the program to get the last 20 years of champions. So right now, I have a list that looks like so: [STL, 2019, WSH, 2018 …] all the way back 20 years. This list would then get fed into my “get_graphs” function, which would use each element in the original list to change a URL, scraping the data from the championship team’s year, along with the league averages from that year. I took this data from the website Hockey-Reference, where my scraping would change the URL of the website by using this link:
www.hockey-reference.com/teams/ + TEAM + / + YEAR +.html
where TEAM would be the abbreviation for the each team and YEAR be the year that the specific team won the Stanley Cup. Hockey Reference also had a page that made it easy for me to scrape the past x Stanley Cup champions.
Now in the program, you have a choice of which stats you would like to keep and which you wouldn’t. Currently, my program gets the following stats of the past championship teams: GF, GA, PP%, PK%, SV%, SO, SV% 5-on-5, CF%. If you don’t know a couple of these abbreviations don’t worry, I’ll explain each one on why I used it originally, and if I kept it in the final algorithm. Let’s start with powerplay % (PP%) and penalty kill % (PK%). This was the first thing that surprised me, just 50% of championship teams in the past 20 years have had their PP% better than the league averages. So naturally, this was left out of the final predictor. PK% on the other hand had 85% of teams over the average. Along with PK%, 90% of championship teams had their save % (SV%), their save % at 5-on-5 (SV% 5-on-5) and their goals against (GA) above the league average. (The earliest known entry for SV% at 5-on-5 is from the year 2008). These were all left in the final predictor. Now CF% is where things get interesting. What is CF%? In short CF% is CF/(CF+CA), where CF is Corsi For (total shots + blocks + misses for your team) and CA is Corsi Against is (total shots + blocks + misses against your team). A longer answer can be read here. So we can give a basic definition saying if a team’s CF% was above 50%, they were controlling the puck more often than their opponent throughout the season. CF% has been tracked in the NHL since 2007, and since then, roughly 83% of championship teams have had a CF% of over 50%. With these stats, our goal is to take them all into one algorithm that will ~hopefully~ be able to predict the Stanley Cup champion.
Alright, so now we have our 7 stats that we’re going to use to try and predict the Stanley Cup champion. What did I do? Well I created a new rating called the JRB rating (name still in development), this rating would factor in each stat differently depending on how prominent it was in championship teams and if the stat made it into something I denoted “elite territory” in my program. The first problem I ran into was that for some reason, the Montreal Canadians were always a top 5 team in the league when sorted by JRB. The reason for this is because not only did Montreal have one of the best CF% in the league, but their goalie stats were slightly above average. This brought me to my next decision of giving teams more JRB points depending on how much better they are then the league average, for why should Carolina get the same amount of JRB points for having a CF% of 54.1 (Very high), compared to a team such as Calgary with a CF% of 50.1% (Just barely controlling the puck more than their opponents. Another notable adjustment I made to my program was that while teams were being rewarded for being above the average or having elite stats, teams like San Jose still had a good JRB score even though their goal differential was a brutal -45. Hence after all these adjustments and changes, the final product was an algorithm that took each of the 7 stats and based on how much better or worse they were adjusted their JRB rating accordingly. A stat that is much higher than the average will get more points than a stat that is barely above the average, and a stat that is very below the average will lose more points than a stat that is just barely below average. From this algorithm we can learn the winner of the 2020 Stanley Cup playoffs will be…
Sadly, (I’m a Leafs fan), the predictor has chosen the Boston Bruins as this year’s Stanley Cup champion, and if you look at their stats, you’ll see why. Boston right now (March 21st, 2020) has a goal differential of +60, a top 3 PK%, the best SV% and the best SV% at 5-on-5. They’re simply dominant in every stat that seems to matter when winning the Stanley Cup, there simply isn’t another way to put it other than Boston has the league’s best chance of winning the Stanley Cup right now.
While stats can tell a lot about a team or a specific player, they can’t tell the whole story. There’s no stat that measures how hard players work in the corners or if players like William Nylander try when they don’t have the puck (I really can never tell) and so there will always be flaws with programs like this. While the below average Arizona Coyotes rank 22nd in a league with 31 teams, they pose a positive 470 JRB rating and rank in the top 10 when sorted by this system. Why? Well they’re scoring more goals then they’re letting in, boast a top 5 penalty kill and Darcy Keumper (their goalie) is ranked 7th in the league in SV%. Although for the most part it seems as though my rating system works, it will take many more years of analytics and trial and error to perfect the JRB rating. Hopefully one day the NHL might use the JRB rating as their version of Next Gen Stats, providing insight on which teams are playing their most elite hockey and who has the best chance of winning the Stanley Cup.
The following is the final list of every team in the NHL with their corresponding JRB rating, these ratings will be updated accordingly as the system becomes more and more accurate as the years go on.