Skip to main content

Adjusting Historical Player Performance Using Talent Distribution

Goal rates via hockey-reference.com
Historical player data from hockeyabstract.com


**Part 2 Includes New Formula For Adjusted Points and Initial HOF tableau viz** http://threepointgames.blogspot.com/2018/01/era-adjustments-continued_2.html **

**Part 3 Includes HOF likelihood score and Tableau with all 100 seasons and more accurate data**
http://threepointgames.blogspot.com/2018/01/era-adjustments-part-3.html



One of the biggest questions hockey fans struggle to answer is how to compare players and teams across different eras. The style of hockey in the high-scoring 80s is certainly different from the dead puck era in the 90s, which are both remarkably different from the current edition of NHL hockey in the salary cap era. During these time periods, drastic changes have taken place around the league, from new rules to expansion, and as a result, we struggle when attempting to compare old and new achievements and records.

Many explanations have been given for the changing scoring rates by season, and one of the most viable is the expansion effect. By increasing the size of the league, talent was much more dispersed, significantly widening the gap between the best and the worst players around the league. This raised scoring and subsequently gave top talents a platform to dominate. Over time, rule changes and the increased popularity of hockey have provided a talent pool deep enough that more teams can be sustained, and the talent gap has decreased. This explanation logically makes sense and can be backed up by big-picture analysis, but there have been no efforts to quantify these changes.

In this article, I attempt to answer how talent distribution has changed in the league, and through this how to adjust player performances so that we can better compare feats such as Wayne Gretzky’s 200+ point record-setting seasons to the dominant 100+ point seasons we have seen from Sidney Crosby.

To approach measuring talent distribution, I decided to look to economics, where a similar question is frequently asked: how is wealth distributed? In economics, one of the most popular ways to answer this question is with a Lorenz curve. As shown in this image to the right, a Lorenz curve measures wealth distribution by comparing the cumulative percentile of people in terms of wealth to the share of income earned by that group. A Lorenz curve will always be scaled from 0% to 100% on both the x and y-axis. In a system of perfect equality (everyone possesses an equal share of wealth) a straight 45-degree line is produced, this is known as the line of equality. As inequality increases, the line becomes bowed inward. In this example, we see that for each of the four quartiles, a different share of national income is assigned. This curve is similar to that of all countries (especially the United States), where the bottom 25% of people hold nowhere near as much wealth as the top 25% of people.

By applying this curve to seasonal player level point production, we can examine how scoring is distributed amongst players. Using data from hockeyabstract.com from the 1967-68 season to the 2015-16 season, we can see how scoring distribution has changed over time. All players who played over 40 games in a given season were used in the distribution, and the two lockout-shortened seasons were removed. By first examining individual seasons, it is difficult to make out any trends over time. Overall, the NHL, as expected, has an unequal distribution of points. Generally, the bottom 50% of players are responsible for around 25% of points, the next 25% of players around 25% of points, and the top 25% of players earn 50% of the points. This does fluctuate over seasons, and by plotting all seasons together we can begin to see how the league has changed.

In this next graph, it is still difficult to make out trends, but the distribution of points does demonstrate more inequality in later seasons. In fact, if you look closely, the 80s and 90s are responsible for the widest distributions, and in the 2000s the curve actually straightens towards the original expansion years. These overall trends coincide with general hockey history, and expansion and related changes do appear to have an effect on talent distribution. To look closer at this, we can plot the distributions together to view how they change over time.

By plotting multiple seasons together to see change over time, we get a much better indication of historical trends. During the 80s and 90s, we can see a clear increase in the curvature and during the early 2000s the Lorenz curves begin to contract. Interestingly, the style of play during the time period does not seem to impede on inequality. The start of the dead puck era does not indicate a decline in contributions from elite players and instead suggests the opposite. Overall, these Lorenz curves provide a great visualization for the changing eras and seem to confirm theories that during initial expansion years, the increased league size and related changes had a noticeable effect on the distribution of players.



Another feature of the Lorenz curve is a measure known as the Gini coefficient (or Gini ratio). This provides a single number that summarizes the inequality for a given curve. It is calculated as the area of A (between the Lorenz curve and Line of Equality) divided by the area of A + B (the total area between the line of inequality and the axis). In a system with a perfectly equal distribution, the Lorenz curve would be the same as the line of equality and the area of A would be 0, thus yielding a Gini coefficient of 0. In a perfectly unequal distribution in which one person has all the wealth, the Gini coefficient would be 1. Although this measure has its drawbacks, such as the fact that different curves yield the same area, it is an effective overall measure. To check out the Gini coefficient for different countries, feel free to check out this list from the CIA World Factbook. For a few examples, the US is estimated most recently at 0.45, Canada at 0.32, and Sweden at 0.25.

Using the Gini coefficient with the previously calculated Lorenz curves, we can make several observations. The first is a view of Gini coefficient over each season, accompanied by a horizontal line of the average value. From 1967-76 the league experiences a shift from 6 to 18 teams, and the coefficient has modest increases, but overall remains essentially unchanged. This goes against original conclusions as to the immediate effects of expansion. During the next era from 1976 to 1995, we observe the greatest increases. This era is also when the talent gap is generally described to be the most prominent. The changing distribution of talent is arguably still a result of expansion, however, this graph suggests the immediacy may not be as prevalent. Moving on, from the dead puck era to present day, the Gini coefficient begins its decline, which reflects the general consensus amongst fans and writers that the talent gap is shrinking.


I originally believed that the distribution of talent could be explained by goal scoring rates, but interestingly, there is little relation between seasonal goals per game and the Gini coefficient for a season (R^2 of 0.07). In fact, the two measures are slightly inversely related, suggesting as goal scoring increases the distribution of talent becomes more even, but once again this is a very weak relationship.



===


Now that we have observed that talent distribution is, in fact, a changing variable throughout NHL history, we will take a new approach to compare players across eras. Using this knowledge we can try to better answer: How would Wayne Gretzky’s performance relative to his peers look if he played in the NHL today?

This question will be answered by adjusting players point totals by both the goal scoring rate of the season and the distribution of scoring in the season. Since we have observed that these two measures are almost independent, the adjustment will account for the separate factors. The scoring adjustment will be done using goal scoring rates for each season, adjusted to the most recent season of the data, 2015-16. This will help to frame the numbers in a context recognizable to fans today. The distribution adjustment was done differently. For each season, I plotted the proportion of goals scored by each percentile, as shown in this example. Since the percentile groups contained different sizes, I modified the proportions using LOESS regression. This also lessened extreme performances so Wayne Gretzky wouldn’t be penalized for his outlier seasons.

Once each season was calculated, I computed a multiplier for each percentile by comparing the base year to the year in question, simply dividing the base curve by the given year curve. In this example, we can see the 1967-68 distribution(first season) compared to the 2015-16 distribution(most recent season). When the 1968 curve is below the base season curve (2016), the multiplier is less than one, signifying that the specific percentile of players contributed more points than that percentile would in the base season. When the multiplier is greater than one, it indicates that a player performing at that percentile contributed fewer points than if they were playing in the base season. When this multiplier is applied to every player in a given season, the overall distribution of points for that season will reflect a similar distribution to points in the base year. Since it is done at a percentile level, the adjustments made take into the account the performance of multiple players, so a player isn’t penalized as much for their own performance.



When graphing just these findings, we learn that the era adjustment is not too extreme in its magnitude, but still has a strong effect. Since the 2015-16 season was one of a rather even distribution, most players in the historic sample have a multiple of less than 1, meaning their adjusted PTS were less than their actual PTS. Players during the 70s were usually the recipients of the increases. Overall, the era adjustment had a maximum multiplier of approximately 3 and a minimum of approximately 0.7. These extremes were only present in players in the first few percentiles where a 1 point season was changed to 3 points and vice versa. Amongst prominent players, the adjustment ranged from +/- 40 points, with multipliers between 1.3 and 0.8.


When combined with the typical adjustment for goal scoring rates, we can begin to get more realistic comparisons between eras.  Overall, the most points added was Martin St. Louis, with 33 points added to his 03-04 total bringing it from 94 to 127. On the other hand, the subtractions were much more significant. Gretzky and Lemieux lost 82 points each in their 81-82 and 88-89 seasons respectively, bringing their totals to 130 and 117. These point totals seem much more achievable, yet still reflect the dominance these players showed in their careers.



Overall, this combined adjustment does a great job of improving our comparisons between players from different eras, although the multipliers are not strong enough to avoid “favoring” historic contributions by high scoring players of the '80s and '90s (or the top players today are just relatively worse). The adjustment definitely has room to be improved, but I believe this helps our understanding of how players performed in a given season compared to the current NHL.

In future renditions, there are several improvements I hope to make. For one, applying the goal scoring multiplier to points assumes that the relationship is even, which I am not  too confident in. Finding the points/GP rate and applying that could help for specifically studying PTS. I also hastily applied this same approach to goals as I did for points, and found interesting results. In the future perhaps I’ll use this methodology to see who the best goal-scorer of all-time is. I also hope to refine the method for calculating the multipliers, as I would like to make sure the percentile is reflective of league distribution rather than being too affected by an individual player. The NHL also recently released updated data that is more accurate and goes further in the past. Using this dataset will improve the calculation accuracy and allow for the full 100 year history of the league to be analyzed, rather than the last 50. I’d also like to add the 16-17 season.

Below are top 20 lists for PPG, career PTS and individual season PTS. Note that goaldif + PTS.dif does not equal netdif, as they are results of separate multipliers. For example a 100 point season with multipliers of 0.5 and 0.5 would have two differences of 50 points, but the net difference would be -75 (total 25 pts.) not -100 (total 0). Also note that Malkin, Iginla, and Thornton were left off of the NHL top 100.

If you have any questions or comments, would like to see the complete data, my code, or any additional visuals, please email me at jflancer@wharton.upenn.edu or comment below!


Data from 1967-68 to 2015-16 Seasons- Ignoring Lockout Seasons ('94-95, '12-13)


Comments

Popular posts from this blog

Tape to Tape Shot Visualization

In this post, I'll be breaking down my newest (and favorite) viz, which acts as a pretty comprehensive overview of tape to tape shot data. This is based on my previous tape to tape viz but has many new features. I'm going to go through each component of the display below, and explain how they work. You'll be able to work with the viz at the bottom of the page, and any feedback or suggestions are greatly appreciated. 1) The Rink First I'm going to explain what you're directly looking at. There are three parts to the rink: the points, the lines, and the tooltip (the box that pops out when you hover over a point/line). Both points and lines are colored by the event result. Goals are green, shots on goal are blue, and missed shots are tan. There are two different points: a circle and a square. Circles represent either where a pass was made or received. Squares represent the location of shot attempts. Lines show the flow of events. They grow in size as the eve

Tape to Tape Tracker Visualization

tapetotapetracker.com has created an excellent way to track shots, shot assists, and zone entries. Using an 11 game sample of 5v5 data from  here  provided by Prashanth Iyer, I created a Tableau visualization to map the shot and shot assist data. This data includes all shot attempts classified by type, and when relevant the pass leading up to the shot. The "origin" is where the passer makes the pass, and "destination" is where the pass is received. Finally, each shot is denoted as "goal" "shot"- SOG or "missed shot", and the location is where the shot was taken. Some features include viewing by the shooter, passer, team, and game. You can also select specific events and results. Result filters an entire event by what its end result was. For example, if "goal" is selected, it will show all events (origin+destination+shot) which resulted in a goal. Similarly, event filters for individual events. This means that specific types

Who Plays Where? Determining Skater Positions Using Clustering

While browsing through various different websites keeping NHL player stats, I realized that the league does a terrible job of keeping updated player positions. I’m not exactly sure how or where they get their data from, but it is quite inaccurate. All sites do distinguish between forwards and defenseman, which is enough for most analysis, but I still think more specific player positions hold value, especially when looking at team depth and related areas. In an attempt to solve this problem, I decided to use k-means clustering on location information within play-by-play data (thanks to Emmanuel Perry and Corsica Hockey for making this cleaned data available to the public). Clustering has been used pretty frequently in hockey analysis, most recently (I believe) to identify different styles of goal scorers by Alex Novet. It has also been used by Ryan Stimson to identify team and player styles with data collected from his passing project and similarly on DTM About Heart’s old blog