While browsing through various different websites keeping NHL player stats, I realized that the league does a terrible job of keeping updated player positions. I’m not exactly sure how or where they get their data from, but it is quite inaccurate. All sites do distinguish between forwards and defenseman, which is enough for most analysis, but I still think more specific player positions hold value, especially when looking at team depth and related areas. In an attempt to solve this problem, I decided to use k-means clustering on location information within play-by-play data (thanks to Emmanuel Perry and Corsica Hockey for making this cleaned data available to the public). 
 
Clustering has been used pretty frequently in hockey analysis, most recently (I believe) to identify different styles of goal scorers by Alex Novet. It has also been used by Ryan Stimson to identify team and player styles with data collected from his passing project and similarly on DTM About Heart’s old blog. More relevant to this post, Conor Tompkins used time on ice and primary points to distinguish between forwards and defenseman. There is also a Sloan Sports Analytics Conference research paper that uses location information in clustering, but with proprietary SPORTLOGIQ data. 
Part 1: Even Strength
The first step in this analysis is to prepare the data. First off, I’ll be using only 5v5 data as this is the only state in which traditional positions are used by both teams. I then refine the data to get four variables for each individual player to be used in the clustering. (In the article conclusion I add faceoff attempts).The first pairing is the mean x and y-coordinates for all fenwick events (shots on goal, missed shots, and goals) the player was responsible for. I chose to exclude blocked shots because of issues with coordinates. The second pairing is the mean x and y-coordinates of a grouping of defensive zone events: hits, takeaways, and penalties taken. While certainly not all-inclusive, I think these three events reflect where a player is positioned in their defensive zone quite well, and this can be seen in the results.
We can now view the results of the five clusters. Each cluster was assigned a position- “LD”, “RD”, “LW”, “RW”, and “C”. The counts for each cluster are as follows:  
RW  LD   C  RD  LW 
109 130 166 118 129
I’ll go into more detail later, but the main problem with these results is that wingers are misclassified as centers, and this can be seen above. There are also a good number of wingers that appear to be shooting on their off side, which changes their cluster to the opposite wing of what they are typically regarded as.
We can now view how the players clustered in both the defensive and fenwick coordinate sets. Thanks to Prashanth Iyer for making available R code to draw a rink. Using only shot data does a terrific job naturally creating three clusters in the form of left defenseman, right defenseman, and forwards. This is actually pretty much what k-means would have yielded had I chose to use three clusters instead of five.
We can now view how the players clustered in both the defensive and fenwick coordinate sets. Thanks to Prashanth Iyer for making available R code to draw a rink. Using only shot data does a terrific job naturally creating three clusters in the form of left defenseman, right defenseman, and forwards. This is actually pretty much what k-means would have yielded had I chose to use three clusters instead of five.
The additional separation amongst forwards is made much more clear using the defensive variables. With these coordinates, we can draw a clear distinction between wingers and centers. Overall, these results are surprisingly accurate. I couldn’t find an accurate dataset to test my findings on, but I went through Rotoworld’s depth charts and for the most part, it matches up.
Using code from Alex Novet’s aforementioned goal scorer clustering piece, we can also use hierarchical clustering (which yields the same groupings) to find the most representative players in each cluster. These are the players that are closest to the average for each cluster.
Part 2: Powerplay Spacing
Conclusions
To summarize, even strength clustering appears to provide a much better indicator of player position than most publicly available datasets. Since Corsica has made play-by-play data available for the past ten seasons, I was also able to go back and apply the same methodology to those years. To better analyze lineup combinations, I plan to combine all seasons in the future to get a bigger sample. All of the data is available here as well as my code- although a slightly older version without many comments or formatting.
As I highlighted earlier, hopefully, more tracking/location data will lead to more helpful and indicative measures on the powerplay. With the even strength data, one of the biggest problems I encountered was wingers being misclassified as centers. The shot charts showed no distinction amongst forward type and considering the small samples, variance, and differing defensive responsibilities wingers could have very easily had a mean closer to the center of the ice. To address this issue, I ran essentially the same clustering with the added variable of face-off attempts. In total there were 91 discrepancies between the two clusters (out of 652 players). All but one was a switch from center to wing or vice versa. At this point, it seems like a judgment call between which one is better, and I’d appreciate ANY feedback in this regard- here’s a direct link. Also note that these players were originally misjudged, so they are likely towards the center in both zones and could very easily fall in any of the three forward positions.
Otherwise, I’m very happy with my results and hope you enjoyed reading.
Comments
Post a Comment