Glossary:
Glossary:
Glossary:
Glossary:
Glossary:
Glossary:
Glossary:
Play-in teams are designated in italics.
These predictions are based on the team ratings, team-specific home court advantages, and a time series model that predicts future performance based on previous games.
Welcome to EvanMiya College Basketball Analytics! The main objective of our work is to assess college basketball team and player strength. We have created an advanced statistical metric, Bayesian Performance Rating (BPR), which quantifies how effective a team or player is, using advanced box-score metrics and play-by-play data. This metric is predictive in nature, which means that each rating is fine-tuned to predict performance in future games.
There are several pages of analysis:
Now for some more detail into how we get these numbers:
We have box score data available for every game played in the each college basketball season, along with play-by-play data, which includes substitutions. The possession by possession data is the main component used to drive our analysis.
One key step that we take to gain the best predictions from our data is to only look at possessions in a game that “mattered”. Analyzing possessions when the game is already well out of hand isn't as valuable to us as possessions when the winner hasn't been decided yet. Through Luke Benz's R package ncaahoopR
, we used the in-game naive win probability (which assumes that teams are equally matched) in order to assess when a game was out of hand. Once a team has a win probability of at least 99%, we start downweighting the possessions until the win probability is greater than 99.99%, at which point we discard all possessions entirely. In the rare situation where the losing team mounts a comeback and the win probability of the winning team sinks below 99%, we start giving each possession full weight again.
From a coach's perspective, every possession matters, even when your team has seemingly won or lost with minutes to spare. However, for predictive purposes, we can't properly assess the strength of a team when both teams aren't putting their normal lineups in or aren't playing as hard as they might if the outcome of the game were still in question.
The purpose behind the Bayesian Performance Rating (BPR) at a team level is to provide each team a true offensive and a true defensive rating that best explains all of the real game results that we observed from the season. These can be used, along with the BPR ratings of the opposing team, to estimate each team's expected offensive and defensive efficiency (points scored per 100 possessions) in a game. Taking the possession by possession results from each game, and adjusting for home court advantage (more on that in a moment), we run a bayesian regression to find the offensive (OPBR) and defensive (DBPR) coefficients for each team. These coefficients are designed to have 0 as the national average. Thus, very good teams will have higher positive offensive and defensive ratings. A team's overall BPR is just the sum of its OBPR and DBPR.
For example, from the 2019-20 season, 4th ranked Baylor's calculated OBPR was 30.2, and their DBPR was 35.9. On the other hand, 319th ranked Idaho had an OBPR of -23.0 and a DBPR of -13.5.
In a neutral court setting, the expected efficiencies for the home and away teams can be calculated using the OBPR and DBPR as follows: \[ E[H_{OffEff}] = (H_O - A_D) / 2 + 100\] \[ E[A_{OffEff}] = (A_O - H_D) / 2 + 100\]
In the above formulas, \(H_O\) and \(H_D\) are the home team's Offensive BPR and Defensive BPR respectively, and \(A_O\) and \(A_D\) are the away team's OBPR and DBPR. To calculate the expected home team offensive efficiency, we take the home team's offensive rating, subtract the away team's defensive rating, then divide by 2 and add 100. (If you are unfamiliar with the notation, \(E[]\) just means “Expected”).
For example, if Michigan State is playing Kansas on a neutral court, and Michigan State has an Offensive BPR of 40 and a Defensive BPR of 30, and Kansas has an offensive rating of 30 and a defensive rating of 50, then we can calculate Michigan State's expected offensive efficiency as
\[ E[\textrm{MSU}_{OffEff}] = (40 - 50) / 2 + 100 = 95\]
and Kansas's expected offensive efficiency is
\[ E[\textrm{Kansas}_{OffEff}] = (30 - 30)/ 2 + 100 = 100\]
If Michigan State and Kansas get 70 offensive possessions each in a game, then the predicted score for Michigan State would be 95 * (70 / 100) = 66.5, and the predicted score for Kansas would be 100 * (70 / 100) = 70. So Kansas would be predicted to win by 3.5 points on a neutral court.
On average, teams playing on their home court score about 3.3 points per 100 possessions better on both sides of the ball than if they were playing on a neutral court. So, as a starting point, we can automatically assume that a home team will have a performance boost of about 3.3 points per 100 possessions in both their offensive and defensive efficiencies. In a game with 68 possessions for each team, which is near the national average, this equates to a home court advantage worth about 4.5 points \(((3.3 + 3.3) * 68/100)\). This home court advantage estimate is slightly higher than other common estimates, because ours is based only on possessions that aren't in garbage time.
Some teams perform better at home than others, so we can find team-specific home court advantages using a Bayesian model with a prior mean of 3.3. We sometimes utilize these team-specific home court advantages when computing the team ratings.
Note: During the 2020-2021 season, this home court advantage is likely to be reduced due to lack of fans. As of right now, we are treating the home court advantage at 40% of its normal value.
When predicting game scores, we also want to adjust for the pace of the game instead of assuming there will be 70 possessions for each team. Similar to our model for calculating expected team efficiencies, we want to calculate the expected number of possessions in a game as follows:
\[ E[\textrm{Possessions}] = (H_T + A_T)/2\]
\(H_T\) is the “True Tempo” of the home team and \(A_T\) is the True Tempo of the away team. We simply take the average of the home team true tempo and the away team true tempo to predict the number of possessions each team will have in the game.
Let's tie all of these concepts together to predict the score of Dayton vs. Gonzaga in the 2019-2020 season, played on Dayton's home court. Note: our actual game prediction algorithm has a bit more complexity under the hood, but using the method below will get you pretty close to our prediction:
First, we will start by predicting Dayton's offensive efficiency in this game. Dayton has an OBPR of 43.1, and Gonzaga has a DBPR of 21.6. On a neutral court, we would expect Dayton's offensive efficiency (points per 100 possessions) in this game to be
\[ E[\textrm{Dayton}_{OffEff}] = (43.1 - 21.6)/2 + 100 = 110.8\]
The Zags have an OBPR of 56.2 and the Flyers have a DBPR of 19.7, which leads to
\[ E[\textrm{Gonzaga}_{OffEff}] = (56.2 - 19.7)/2 + 100 = 118.3\]
To adjust for Dayton's home court advantage, we add 2.6 points to Dayton's offensive efficiency and subtract 2.6 points from Gonzaga's, which gives us
\[ E[\textrm{Dayton}_{OffEff}] = 110.8 + 3.3 = 114.1\] \[ E[\textrm{Gonzaga}_{OffEff}] = 118.3 - 3.3 = 115.0\]
Now we need to predict how many offensive possessions each team will have. Dayton's True Tempo is 68.1 and Gonzaga's is 75.0. We take the average of these to get our expected possession count:
\[ E[\textrm{Possessions}] = (68.1 + 75.0)/2 = 71.6\]
Now we can finally predict the score by multiplying each team's expected offensive efficiency by the expected number of possessions we just calculated, divided by 100:
\[ E[\textrm{Dayton}_{Score}] = 114.1 * \left(71.6/100\right) = 81.7\] \[ E[\textrm{Gonzaga}_{Score}] = 115.0 * \left(71.6/100\right) = 82.3\]
Gonzaga is predicted to beat Dayton by 0.6 points in a nailbiter.
In the Bayesian Performance Rating for players, each player has an Offensive BPR and a Defensive BPR, which are added together to make the player's overall BPR. Player BPR has two components: player impact and player efficiency.
The player impact part of BPR attempts to quantify a player's value to his team by looking at how efficiently his team performed on offense and defense for every possession he played. In addition, we want to adjust for the strength of his teammates on the court with him, along with the strength of opposing players for each possession he was on the court. There are some good existing advanced metrics that attempt to do this, such as Adjusted Plus-Minus. This type of metric focuses on the idea that a player's contribution to his team's margin of victory matters most. APM does not use any individual player statistics, but instead utilizes the score outcome of each possession to determine what players are better than others at positively affecting the outcome of the game, in the form of offensive and defensive efficiency. Our player impact ratings are created in a similar fashion, but we make a few adjustments to negate some of the weaknesses of this type of model, which we will explain later on.
Similar to the BPR team ratings, we want to assign a “true” offensive and defensive rating to each player, which indicates his value to his team when he is on the court. If we have five home players and five away players on the floor, and each player has an Offensive BPR and Defensive BPR, then we will define \(H_{1O}\) and \(H_{1D}\) as the OBPR and DBPR for home team player 1, \(H_{2O}\) and \(H_{2D}\) as the ratings for home team player 2, and so on. The same goes for away team players, as \(A_{1O}\) and \(A_{1D}\) are the ratings for away team player 1. For any 10 players on the court for a given possession, we can calculate the expected team efficiencies (points per 100 possessions) with those players on the court as follows:
\[ E[H_{OffEff}] = \frac{(H_{1O} + H_{2O} + H_{3O} + H_{4O} + H_{5O}) - (A_{1D} + A_{2D} + A_{3D} + A_{4D} + A_{5D})}{10} + 100\]
and
\[ E[A_{OffEff}] = \frac{(A_{1O} + A_{2O} + A_{3O} + A_{4O} + A_{5O}) - (H_{1D} + H_{2D} + H_{3D} + H_{4D} + H_{5D})}{10} + 100\]
In this formula, all five offensive and defensive players equally contribute to the expected outcome of a possession, based on each player's OBPR and DBPR. Using this model, we want to find an offensive and defensive rating for each player that can best explain the results from every possession that occurred from the season. Using the possession by possession results from each game, along with our information about who was on the court for each possession, a bayesian regression finds the offensive and defensive coefficients for each player. Very good players will have higher positive offensive and defensive ratings, with the average D1 player OBPR and DBPR being set at 0.
The main draw of this type of model is that we not only assess the value of a player to his team, but also account for the strength of the other teammates he shares the court with, along with the strength of the opponent players he faces. If we were to look at a more crude measure of player impact, like plus-minus or basic team efficiency when he is on the floor, it can be helpful, but doesn't answer questions such as “did he play with good teammates or bad teammates?” and “Did he play so well because he only played in garbage time against inferior opponents?”. By using a model that adjusts for the strength of all players on the court, we can more accurately assess the value that a player brings to his team when he is on the court.
There are a few shortcomings to this model the way things currently stand. One issue is that there is a lot of “noise” in this data. Due to the randomness of basketball possessions, it can be difficult to know whether a player rating estimate reflects the truth about that player's ability or is due to random chance. The model can “overfit” the data, leading to conclusions about players that just don't make sense when compared to the eye-test. For example, a deep-bench player who happened to be on the court for a handful of minutes when his team outscored the opponent 20-0 could be given an incredibly high rating because it appears that his appearance was what made the difference for his team. To account for this, we use a bayesian approach by setting a prior distribution for each player's OBPR and DBPR centered at 0, so that players who don't play many minutes will having ratings near 0, while those who have more substantial playing time can have their ratings move away from 0 as more information about their impact is accrued throughout the season. The informativeness of the prior distribution was decided using cross-validation.
Another issue with the player impact model is that it relies heavily on the assumption that a roster of players will frequently rotate in and out of the game so that we benefit from seeing lots of different lineup combinations, allowing us to distinguish each individual's impact on his team, when compared to his teammates. These player ratings become less reliable when there are pairs of teammates who are almost always on the court together, or rarely every share possessions together. In situations where player A and B are on the court together 95% of the time, it is difficult to distinguish which teammate is having the larger impact for his team.
This is where the player efficiency portion of BPR comes into play. We want to use both the information about player impact on a play-by-play level, along with individual box score statistics, in the form of player efficiency metrics, to come up with the best predictive ratings possible. A widely acknowledged advanced efficiency metric used in basketball is Player Efficiency Rating (PER), created by John Hollinger. PER uses all of a player's individual statistics in a season in order to come up with a single number that best represents his contribution. Though PER isn't perfect, it is easy to calculate and can give us a good starting point for evaluating a player's statistical worth. Though we don't want to use PER as the final representation of a player's performance, we can still use the metric to help guide our final Bayesian Performance Ratings by creating an informative prior distribution on a player's rating based on his PER. By using data from the past several years, we can see how well PER functions as a predictor for BPR. Then, instead of each player's prior distribution for OBPR and DBPR being centered at 0, we can center it at a value predicted by the PER rating of that player. The graph below shows the relationship between PER and player impact rating for the 2018-2019 season:
This technique has turned out to be incredibly beneficial at generating player ratings that more accurately represent both the value and skill of each player at the offensive and defensive end. An example of this is 2018-2019 Brandon Clarke, who had a tremendous season for Gonzaga before becoming a first round draft pick. In the player impact ratings, he is ranked 10th best in the country for 2018-2019. However, once we use his high PER rating of 37.3 to inform his prior distribution for offensive and defensive ratings, he finishes 2nd in the country in our final BPR, behind Zion Williamson. Zach Norvell, a fellow teammate of his, sees his ranking drop from 7th to 13th once we incorporate his PER for the year, which was only 21.6.
Using PER to influence our ratings doesn't change the fact that we can still easily detect good performances from players who otherwise may not fill up the stat sheet. A prime example of an underrated “intangibles” guy is 2019-2020 Alabama forward Herbert Jones, who had the highest DBPR and third highest overall BPR that year, despite only having a PER of 15.2. The degree to which he elevated his team's performance when he was on the court was astronomic, compared to Alabama's numbers without him.
The Team Breakdown tool is used to gain detailed insights into the performance of a team, broken down player by player. This is especially helpful when trying to explain the offensive and defensive ratings assigned to each player.
Here is the recommended approach for using the Team Breakdown:
There are several factors that we are assessing currently or in the near future:
My name is Evan Miyakawa, and I have my masters degree in statistics and am currently working on my dissertation for my doctorate in statistics at Baylor University. I graduated with my Bachelor's Degree from Taylor University. You can find out more on my LinkedIn page.
Note that this project is not a part of my dissertation work but is something I’ve been putting together in my free time.
Feel free to email me at evanmiyakawa@gmail.com with any questions. You can also find me on Twitter.