While browsing through various different websites keeping NHL player stats, I realized that the league does a terrible job of keeping updated player positions. I’m not exactly sure how or where they get their data from, but it is quite inaccurate. All sites do distinguish between forwards and defenseman, which is enough for most analysis, but I still think more specific player positions hold value, especially when looking at team depth and related areas. In an attempt to solve this problem, I decided to use k-means clustering on location information within play-by-play data (thanks to Emmanuel Perry and Corsica Hockey for making this cleaned data available to the public).
Clustering has been used pretty frequently in hockey analysis, most recently (I believe) to identify different styles of goal scorers by Alex Novet. It has also been used by Ryan Stimson to identify team and player styles with data collected from his passing project and similarly on DTM About Heart’s old blog. More relevant to this post, Conor Tompkins used time on ice and primary points to distinguish between forwards and defenseman. There is also a Sloan Sports Analytics Conference research paper that uses location information in clustering, but with proprietary SPORTLOGIQ data.
Part 1: Even Strength
The first step in this analysis is to prepare the data. First off, I’ll be using only 5v5 data as this is the only state in which traditional positions are used by both teams. I then refine the data to get four variables for each individual player to be used in the clustering. (In the article conclusion I add faceoff attempts).The first pairing is the mean x and y-coordinates for all fenwick events (shots on goal, missed shots, and goals) the player was responsible for. I chose to exclude blocked shots because of issues with coordinates. The second pairing is the mean x and y-coordinates of a grouping of defensive zone events: hits, takeaways, and penalties taken. While certainly not all-inclusive, I think these three events reflect where a player is positioned in their defensive zone quite well, and this can be seen in the results.
Now that the variables are set, k-means clustering can be used. The first step in k-means is to determine the number of clusters. I chose to use the elbow-method, which is essentially done by looking at where this graph forms a bend. I think 3 clusters is probably the best choice, but 5 clusters is certainly a reasonable choice, and since there are 5 traditional skater positions that is what I’ll be using.
We can now view the results of the five clusters. Each cluster was assigned a position- “LD”, “RD”, “LW”, “RW”, and “C”. The counts for each cluster are as follows:
RW LD C RD LW
109 130 166 118 129
I’ll go into more detail later, but the main problem with these results is that wingers are misclassified as centers, and this can be seen above. There are also a good number of wingers that appear to be shooting on their off side, which changes their cluster to the opposite wing of what they are typically regarded as.
We can now view how the players clustered in both the defensive and fenwick coordinate sets. Thanks to Prashanth Iyer for making available R code to draw a rink. Using only shot data does a terrific job naturally creating three clusters in the form of left defenseman, right defenseman, and forwards. This is actually pretty much what k-means would have yielded had I chose to use three clusters instead of five.
We can now view how the players clustered in both the defensive and fenwick coordinate sets. Thanks to Prashanth Iyer for making available R code to draw a rink. Using only shot data does a terrific job naturally creating three clusters in the form of left defenseman, right defenseman, and forwards. This is actually pretty much what k-means would have yielded had I chose to use three clusters instead of five.
The additional separation amongst forwards is made much more clear using the defensive variables. With these coordinates, we can draw a clear distinction between wingers and centers. Overall, these results are surprisingly accurate. I couldn’t find an accurate dataset to test my findings on, but I went through Rotoworld’s depth charts and for the most part, it matches up.
Using code from Alex Novet’s aforementioned goal scorer clustering piece, we can also use hierarchical clustering (which yields the same groupings) to find the most representative players in each cluster. These are the players that are closest to the average for each cluster.
Using the position data, I decided to look at how different lineups performed, similar to what Ryan Stimson did with his playing styles. Here are unadjusted results, showing only lineups of 150+ fenwick events and included only players that were used in clustering. As expected, the typical C-LW-RW-LD-RD lineup was by far the most frequent, showing teams do a pretty good job keeping players at the same position. This lineup also ranked as one of the top lineups, possibly showing that a portion of teams' struggles are when they are playing mid-change or without their pre-set lines. Otherwise, I think this is interesting to look at, but I’m not sure if any overarching conclusions can be drawn.
I also looked specifically at d-pairings. Teams do perform better when sticking with the traditional left and right defenseman. This makes sense logically and reinforces prior research by Dom Galimini showing the importance of handedness for defenseman.
Part 2: Powerplay Spacing
I also used the same methodology to cluster on the power play (5v4). Since the positioning of interest is only in the offensive zone, only shot locations were used, and as a result, these findings are not as clear-cut as even-strength. First, I chose to use four clusters. Two or three are similarly effective, but four allows the players to be grouped into categories that are more easily explained. I named the clusters as “Point”, “Left Side”, “Right Side”, and “Net Front”. The shot locations for all categories are rather similar, and I think using passing data or a similar dataset would significantly improve this analysis. Nonetheless, I still think the results are at least interesting enough to share.
Unlike the most representative players in the last section, here I decided to size the points by the quantity of shots taken and highlight the players in each section that had the most attempts. These players are pretty much as expected for each group. Ovechkin as the most representative on the left side is definitely assuring.
Like at even strength, I also looked at how different lineups performed. This has much more promising results than the former, and I plan to look into this further in the future. With this set, I used the adjusted fenwick values provided in the pbp, which I believe is just venue but I’m not completely sure. Additionally, measuring power play results using corsi/fenwick may not be the best indicator, but it is all that is available. The average amongst all groups was ~86%- which appears to be slightly lower than the league average using Corsica data. This is likely because only thirteen setups qualify (100+ attempts). The most typical setup was a point man, a player on the left boards, right boards, and two players in front. This seems to line up with a conventional power play setup. This setup does perform lower than the average amongst all groups, suggesting that this strategy may not optimal. Three of the four lineups with two point players graded out as some of the lowest, which reinforces the advantage of using 4 forwards - 1 defenseman. Otherwise, it is pretty difficult to come to any other conclusions, but as I mentioned earlier, richer location data should significantly improve these results.
Unlike the most representative players in the last section, here I decided to size the points by the quantity of shots taken and highlight the players in each section that had the most attempts. These players are pretty much as expected for each group. Ovechkin as the most representative on the left side is definitely assuring.
Conclusions
To summarize, even strength clustering appears to provide a much better indicator of player position than most publicly available datasets. Since Corsica has made play-by-play data available for the past ten seasons, I was also able to go back and apply the same methodology to those years. To better analyze lineup combinations, I plan to combine all seasons in the future to get a bigger sample. All of the data is available here as well as my code- although a slightly older version without many comments or formatting.
As I highlighted earlier, hopefully, more tracking/location data will lead to more helpful and indicative measures on the powerplay. With the even strength data, one of the biggest problems I encountered was wingers being misclassified as centers. The shot charts showed no distinction amongst forward type and considering the small samples, variance, and differing defensive responsibilities wingers could have very easily had a mean closer to the center of the ice. To address this issue, I ran essentially the same clustering with the added variable of face-off attempts. In total there were 91 discrepancies between the two clusters (out of 652 players). All but one was a switch from center to wing or vice versa. At this point, it seems like a judgment call between which one is better, and I’d appreciate ANY feedback in this regard- here’s a direct link. Also note that these players were originally misjudged, so they are likely towards the center in both zones and could very easily fall in any of the three forward positions.
Otherwise, I’m very happy with my results and hope you enjoyed reading.
Comments
Post a Comment