Hockey

The Stanley Cup Formula: An investigation through machine learning by Guest User

Crosby Cup.jpeg

By Owen Kewell

NHL seasons follow a formulaic plotline.

Entering training camp, teams share a common goal: win the Stanley Cup. The gruelling 82-game regular season separates those with legitimate title hopes from those whose rosters are insufficient, leaving only the sixteen most eligible teams. The attrition of playoff hockey gradually whittles down this number until a single champion emerges victorious, battle-tested from the path they took to win hockey’s top prize. Two months off, then we do it all again.

Teams that have won the Stanley Cup share certain traits. Anecdotally, it’s been helpful to have a dominant 1st line centre akin to Sidney Crosby, Jonathan Toews or Anze Kopitar. Elite puck-moving defensemen don’t hurt either, nor does a hot goalie. Delving deeper, though, what do championship teams have in common?

I decided to answer this question systematically with the help of some machine learning.

Some background on classification

Classification is a popular branch of supervised machine learning where one attempts to create a model capable of making predictions on new data points. We do this by building up, or ‘training’, the model using historical data, explicitly telling the model whether each past data point achieved the target class that we’re trying to predict. In the context of hockey, this data point could be some number of team statistics produced by the 2015 Chicago Blackhawks. The target here would be whether they won the Stanley Cup, which they did.

Sufficiently robust classification models can identify a number of statistical trends that underpin the phenomenon that they’re observing. The models can then learn from these trends to make reasonably intelligent predictions on the outcome of future data points by comparing them to the data that the classifier has already seen.

Building a hockey classifier

We can apply these techniques to hockey. We have the tools to train a model to learn which team statistics are most predictive of playoff success. To do this, we must first decide which stats to include in our dataset. To create the most intelligent classifier, we decided to include as many meaningful team statistics as possible. Here’s what we came up with:

team stats.jpg

It’s worth noting that we engineered the ‘Div Avg Point’ feature by calculating the average number of points contained by all teams in a given team’s division. The remaining statistics were sourced from Corsica and Natural Stat Trick. An explanation of each of these stats can be found on the glossaries for the two websites.

Our dataset included 210 data points: 30 teams per season over the 7 seasons between 2010-11 and 2016-17. Each data point included team name, the above 53 team stats, and a binary variable to indicate whether the team in question won the Cup. Using this data, we trained nine different models to recognize the statistical commonalities between the 7 teams whose seasons ended with a Stanley Cup championship. The best-performing model was a Logistic Regression model trained on even-strength data, and so all further analysis was conducted using this model.

Results: Team stats that matter most to Stanley Cup winners

To evaluate which team stats were most strongly linked to winning a Cup, we created a z-score standardized version of our team data. We then calculated the estimated coefficients that our logistic regression model assigned to each team stat. The size of these coefficients indicates the relative importance of different team stats in predicting Stanley Cup champions. The 5-highest ranking team stats can be seen below:

top 5 team stats.jpg

Of all team statistics, ‘Goals For Per 60 Minutes’, or GF/60, is most predictive of winning a Stanley Cup. Of the 7 champions in the dataset, 4 ranked within the top 5 league-wide in GF/60 in their respective season, with 2016-17 Pittsburgh most notably leading the league in the statistic. Impressive results in ‘High Danger Chances For’ and ‘Team Wins’ both strongly correlate to playoff success, while ‘Scoring Chance For Percentage’ and ‘Shots on Goal For Percentage’ round out the top 5.    

What does it mean?

Generating a list of commonalities among past champions allows us to comment on what factors impact a team’s likelihood of going all the way. Most apparent is the importance of offense. It is more important to generate goals and high-danger chances than it is to prevent them, as GA/60 and HDCA rank 36th and 13th among all statistics, respectively (their corollaries are 1st and 2nd). In the playoffs, the best team offense tends to trump the best team defense, which we saw anecdotally in last year’s Pittsburgh v Nashville Final. If you want to win a Stanley Cup, the best defense is a good offense.

offense vs defense.jpg

We can see that a team’s ability to generate scoring chances, both high-danger and otherwise, is more predictive of playoff success than their ability to generate shots. Although hockey analytics pioneers championed the use of shot metrics as a proxy for puck possession, recent industry sentiment has shifted towards the belief that shot quality matters more than shot volume. The thinking here, which is supported by the above results, is that not all shots have an equal chance of beating a goalie, and so it is more important to generate a shot with a high chance of going in than it is to generate a shot of any kind. Between a team who can consistently out-chance opponents and a team who can consistently out-shoot opponents, the former is more likely to win a hockey game, and therefore playoff series.  

Application: The 2017-18 season

A predictive model isn’t very helpful unless it can make predictions. So let’s make some predictions.

By feeding our model the team stats produced by the recently-completed 2017-18 regular season, we can output predictions of each team’s likelihood of winning the 2018 Stanley Cup. Since this is the fun part, let’s get right to the probability estimates for all 31 NHL teams:

probability estimates.jpg

The rankings above essentially indicate how similar each team’s season was to the regular season of teams that went on to win it all. In doing so, they hope to identify the teams most likely to replicate this success The model favours the Boston Bruins to win the 2018 Stanley Cup, predicting a victory over the Nashville Predators in the Final.

The above data highlights a few curiosities. Notably, we can see that some non-playoff teams had 5-on-5 numbers that were relatively comparable to past Cup champions. Specifically, the Blues, Stars, and Flames played 5-on-5 hockey well enough this season to qualify for the playoffs. The Blues and Flames can attribute their disappointingly long off-seasons to the 30th and 29th-ranked power plays, respectively. The Stars’ implosion is more of a statistical anomaly, and while conducting an autopsy would be interesting it would be better served as a subject for another article.

The lowest-ranked teams to have made the playoffs in the real world are the New Jersey Devils and the Washington Capitals. While their offensive star power might have been enough to get these squads to the dance, the model predicts a quick exit for them both.

A computer-generated bracket:

2018bracket.jpg

For fun, I’ve filled out the above bracket using the class probability rankings generated by our model. Of the 8 teams who have won or are winning their first-round playoff series, the model picked 7 of them as at the winner, with Philadelphia being the exception. While it’s far too early to comment on the model’s accuracy, as only a single playoff series has been completed, it’s an encouraging start.

Limitations of the analysis

The above results must be considered in the appropriate context. The model was trained and tested using only 5-on-5 data, which would explain the lack of love for teams with strong special teams like Pittsburgh and Toronto. The model is also blind to the NHL’s playoff format. Due to the NHL’s decision to have teams play against their divisional foes during the first two playoff rounds, teams in strong divisions have a much harder road to winning a Cup. Consider that Minnesota’s path to the conference final would likely involve Winnipeg and Nashville in the first two rounds, who finished 2nd and 1st in NHL standings in the regular season. Divisional difficulty is not reflected in the probabilities listed above, though incorporating divisional difficulty either probabilistically or through a strength of schedule modifier could be areas of further analysis.

A final limitation of the model is that it is trained using only 7 champions. In an ideal world, we would have access to dozens or hundreds of Stanley Cup positive instances, but due to the nature of the game there can only be one champion per year. We considered extending the dataset backwards past 2011 but ultimately decided against doing so. The NHL is different today than it was in the past. Training a model on a champion from 2000 tells us little about what it takes to have success in 2018. Using 2010-11 onwards represented a happy medium in the trade-off between data relevance and quantity.

What next?

Winning a Stanley Cup remains an inexact science. While it’s valuable to identify trends among past winners, there is no guarantee that what’s worked before will work again. It’s a game of educated guesses.

I believe that the most legitimate way to build a Stanley Cup winner is a combination of the past and the future. Analyzing historical data to identify team traits that are predictive of a championship is half the battle. The rest is anticipating what the future of the NHL will look like. The champions of the next few years will be lead by managers who are best able to identify what it’ll takes to win in the modern NHL. While the above framework approaches the first half in a systematic way, the latter remains much harder to crystallize.

In the meantime, let’s turn to what’s in front of our eyes. The playoffs have been tremendously entertaining thus far, and that’ll only pick up as teams are threatened by elimination. Let’s enjoy some playoff hockey. Let’s see which playing styles, tactics, and matchups seem to work. Let’s learn.

Even if your team gets eliminated, just remember that this season’s playoffs are just a couple months away from being data points to train next season’s model.

Then we do it all again.

Cover photo credited to Christopher Hanewincke — USA Today Sports

What's a Corsi Anyway?: An Intro to Hockey Analytics by Guest User

By Owen Kewell, Scott Schiffner, Adam Sigesmund (@Ziggy_14), Anthony Turgelis (@AnthonyTurgelis)

Advanced statistics is an area that has recently started to pick up steam and shift into the mainstream focus in hockey over the past decade. Many NHL teams now employ full-time analytics staff dedicated to breaking down the numbers behind the game. So, what makes analytics such a powerful tool? Aside from helping you dominate your next fantasy hockey pool, advanced statistics provide potent insights into what is really causing teams to win or lose.

Hockey is a sport that has long been misunderstood. Its gameplay is fundamentally volatile, spontaneous and difficult to follow. There are countless different factors that contribute to a team’s chances of scoring a goal or winning a game on a nightly basis. While many in Canada would beg to differ, ice hockey still firmly occupies last place in terms of revenue and fan support amongst the big four major North American sport leagues (NFL, MLB, NBA, & NHL). As such, hockey is on the whole overlooked and is often the last to implement certain changes that come about in professional sports. The idea of a set of advanced statistics that would offer better insights into the game arose as other major sports leagues, starting with Major League Baseball, began looking beyond superficial characteristics and searching for the underlying numbers influencing outcomes. Coaches, players, and fans alike have all been subjected over the years to an epidemic failure to truly understand what is happening out there on the ice. This is the motivation behind the hockey analytics movement: to use data analysis to enhance and develop our knowledge of ice hockey and inform decision-making for the benefit of all who wish to understand the sport better.

Another barrier to progress in the field of hockey analytics is the hesitance of the sport to embrace modern statistics. Most casual fans are familiar with basic stats such as goals, assists, PIM, and plus/minus. But do these stats really tell the full story? In fact, most of these are actually detrimental to the uninformed fan’s understanding of the game. For starters, there is usually no distinction between first and second assists in traditional stat-keeping. A player could have touched the puck thirty seconds earlier in the play or made an unbelievable pass to set up a goal, and either way it still counts as a single assist on the scoresheet. Looking only at goals and assists can be deceiving; we need more reliable, repeatable metrics to determine which players are most valuable to their teams. Advanced stats are all about looking beyond the surface and identifying what’s actually driving the play.

So, what are these so-called “advanced stats”? Let’s start with the basics.

PDO: PDO (it doesn't stand for anything) is defined as a team’s save percentage (usually 5v5) + shooting percentage, with an average score of 1. If you only learn one concept, it's this one. It is usually regarded as a measure of a team or player’s luck, and can be a useful indicator that a player is under/over performing and whether a regression to the mean (back towards 1.000) is likely. This will not happen in every situation, of course, but watch for teams that have astronomic PDOs to hit a reality check sooner rather than later. Team PDO stats can be found on corsica.hockey’s team stats page.

Without trying to scare anyone, the Toronto Maple Leafs currently boast the 4th highest PDO at 101.85. To help ease your mind a bit, the Tampa Bay Lightning who are considered the team to beat in the East have the highest PDO of 102.35, and there's a decent gap between second place. They could be currently playing at a higher level than they really are as well, time will tell. 

Corsi: You may have heard of terms like Corsi and/or Fenwick being thrown around before. These are core concepts that are fundamental to understanding what drives the play during a game. Basically, Corsi is an approximation of puck possession that measures the total shot attempts for your team, and against your team, and stats can be viewed for Corsi results when a specific player is on the ice.

A shot attempt is defined as any time the puck is directed at the goal, including shots on net, missed shots, and blocked shots. Anything above 50% possession is generally seen as being positive as you are generating more shot attempts than you are allowing.

Corsi stats are typically kept in the following ways: Corsi For (CF), Corsi Against (CA), +/-, and CF%. An example of how CF% can be useful is when evaluating offensive defensemen. Sometimes, these players are overvalued because of their noticeable offensive production, while failing to consider that their shaky defensive game offsets the offensive value they provide. 

Fenwick: Fenwick is similar to Corsi, but excludes shot attempts that are blocked. Of course, with both of these stats, one should also take into account that a player’s possession score is influenced by both their linemates as well as the quality of competition (QoC). These stats can always be adjusted to reflect different game scenarios, like whether the team was up or down by a goal at the time, etc.

Measuring puck possession in hockey makes sense, because the team that has the puck on their stick more often controls the play. Granted, Corsi/Fenwick are far from perfect, and the team with the better possession metrics doesn’t always come out ahead. But at the very least, including all shot attempts offers a much larger sample size of data than traditional stats, and provides a solid foundation for further analysis.

Zone Starts (ZS%): this measures the proportion of the time that a player starts a shift in each area of the ice (offensive zone vs. defensive zone). A ZS% of greater than 50% tells us that the player is deployed in offensive situations more frequently than defensive situations. This is important because it gives us insight into a player’s usage, or in what scenarios he is normally deployed by his team’s coach. It also provides context for interpreting a player’s Corsi/Fenwick. Players who are more skilled offensively will tend to have a higher ZS% because they give the team a better chance to take advantage of the offensive zone faceoff and generate scoring opportunities. At the very least, ZS% can be used to get a glimpse at how a coach favors a player’s skillset.

Intro on 5v5 Isolated Stats and Repeatability

Often times, you will see those who do work with hockey analytics cite a player's stats solely while they are at even strength, or 5v5. Why? There's a few reasons.

First, 5v5 obviously takes up most of the hockey game. If a player is valuable to his team at 5v5, he will be valuable to a team for more time throughout the game, and this should be seen as a large positive. A player's power play contributions are certainly valuable to a team, but often over-valued. Next, the game is played very differently at different states. It would be wildly unfair to penalty killers to have their penalty kill stats included in their overall line, as more goals against are scored on the penalty kill, even for the best penalty killers. Separating these statistics helps provide a more complete picture into the player's skillset and value that they have contributed to their team. Finally, 5v5 stats are generally regarded as the most repeatable, partially due to the larger sample. While players' PP and PK stats can highly vary by year, 5v5 stats typically remain relatively stable (read more at PPP here if you like).

In addition, primary points (goals and first assists) have been regarded as relatively repeatable stats, so be on the lookout for player's that have many secondary assists to possibly have their point totals regress in the future (read more on this here).

Intro to Comparison Tools

One of the areas that has most benefited from hockey analytics is the domain of player comparison. One of the best and most intuitive tools is the HERO chart, as pioneered by Domenic Galamini Jr (@MiminoHero). The HERO chart is a quick comparison of how players stack up across ice time, goal scoring, primary assists, shot generation and shot suppression. At a single glance, we can get a sense of the strengths and contributions of different players. Here we compare Sidney Crosby to Connor McDavid:

hero.png

We can see that Crosby is better at goal-scoring and shot generation, while McDavid is better at primary assists and shot suppression.

To compare any two players of your choice, or to compare a player to a positional archetype like First-Line Centre or Second-Pair Defender, you can use Galamini’s website: http://ownthepuck.blogspot.ca/. These comparisons can be used to enhance understanding of a player’s skill set, inform debates, and evaluate moves made by NHL teams, among other uses.

All-3-Zone Data Visualizations:

While a HERO chart is an all-encompassing snapshot of a players contributions on the ice, the All-Three-Zones visuals are concerned with more specific aspects of the game. CJ Turtoro (@CJTDevil) created two sets of visuals using data from Corey Sznajder’s (@ShutdownLine) massive tracking project.

You can find both sets of visuals at the links below:

  1. https://public.tableau.com/profile/christopher.turtoro#!/vizhome/ZoneTransitionsper60/5v5Entries

  2. https://public.tableau.com/profile/christopher.turtoro#!/vizhome/2-yearA3ZPlayerComps/ComparisonDashboard

In the first set of visuals, you will find 4 leaderboards. Players are ranked in the 5v5 stats listed below.

  • 5v5 Entries -- How often players enter the offensive zone by making a clean pass to a teammate (Entry passes/60) or by carrying the puck across the blue line themselves (Carry-ins/60).

Other notes: The best way to enter the zone is to enter with possession of the puck (Entry passes + Carry-ins, as discussed above). These types of entries are called Possession Entries. Although other types of attempts are included in the leaderboard as well, players are automatically sorted by Possession Entries/60 because these alternative attempts are less than ideal. If you decide to change this, use the “Sort By (Entries)” filter to rank the players in other ways.

  • 5v5 Exits -- This is the same as 5v5 entries, except at the blue line separating the defensive zone from the neutral zone. Players are ranked based on how often they transition the puck from the defensive zone into the neutral zone either by carrying it (Carries/60) or by passing it to a teammate (Exit Passes/60).

Other notes: Like 5v5 entries, the best ways to exit the defensive zone are classified as Possession Exits. This is why players are automatically sorted by Possession Exits/60. Again, the “Sort By (Exits)” filter will let you change how the leaderboard is sorted.

  • 5v5 Entries per Target (5v5 Entry Def %) -- This stat measures defence at the blue line. It answers the question: When a defender is in proximity to an attempted zone entry, how often does he stop the attempt?

Other Notes: It is important to note that a “defender” is any player on the team playing defence (i.e. the team without the puck). Forwards are included in this definition of defender, but the best way to use this leaderboard is to judge defensemen only. This is why forwards are automatically filtered out of the leaderboard, but you can always change this using the filter if you wish.

  • 5v5 Shots and Passes -- Players are ranked based on how often they contribute to shots. Players contribute to shots by being the shooter or by making one of three passes immediately before the shot in the same way they earn points by scoring a goal or by making one of two passes immediately before the shot was taken.

If you want a closer look at certain groups of players, the filters allow you to look at players who play certain positions (forwards/defencemen) and players who play on certain teams. In the screenshot below, for example, I filtered the 5v5 Entries leaderboard to see what it looks like for forwards on the Oilers:

entries:60.png

You can use these leaderboards to judge offence (5v5 entries, 5v5 shot contributions), and defence (5v5 exits, 5v5 Entry Def %). Ultimately, these four leaderboards will help you identify the best and worst players in these areas.

In order to focus on one or two players, you should use the second set of visuals: The A3Z Player Comparison Tool. While HERO charts allow for player comparisons in stats collected by the NHL, this visualization was designed to help you judge players based on their performance in several stats from the tracking project. Instead of standard deviations, however, the measurement of choice in this comparison tool is percentiles. So keep in mind that “100” means the result is better than 100% of the other results. You can view a players results in two 1-year windows and one 2-year window, covering the 2016-17 season and the 2017-18 season. Here’s a two-year snapshot of how Erik Karlsson and Sidney Crosby rank in some of these key stats:

a3z.png

You probably noticed that the stats for forwards and defencemen are slightly different. The only difference is that defencemen have three extra categories, which measure their ability (or lack thereof) to defend their own blue line (i.e. their 5v5 Entries per Target, as discussed in the previous section). You may have also noticed some useful information hidden beneath each players name, including the numbers of games and minutes that have been tracked for the player. Although the numbers in the screenshot above are from two seasons, another thing to keep in mind is that you can also compare a players development over two seasons by looking at their stats in one-year windows. To see what I mean, take a look at Nikita Zaitsev’s numbers in two consecutive seasons:

zaitsev.png

Visualizing the dramatic fall of Nikita Zaitsev in this way is an excellent starting point for further analysis. Likewise, you can also compare two different players in the same season or over two seasons. This is, after all, a Player Comparison Tool. Other common uses for both sets of A3Z visualizations are to identify strengths and weaknesses of certain players, to evaluate potential acquisitions, to design the optimal lineup for your favourite team, and many more.

Of course, there are countless other useful terms and concepts to consider in analytics, like relative stats, shot quality, and expected goals (xG), which we’ll be touching upon more in-depth in future articles. If you’re interested in advanced stats and would like to learn more, we’ll be putting out more content on exciting topics in hockey analytics over the coming months, so stay tuned.


Keep up to date with the Queen's Sports Analytics Organization. Like us on Facebook. Follow us on Twitter. For any questions or if you want to get in contact with us, email qsao@clubs.queensu.ca, or send us a message on Facebook.