A Treatise on Strength of Schedule

Mar 13, 2016
6 min read

I hate the NCAA's strength of schedule metric. I hate even more that it is supposed to influence the selection of teams for at-large bids to the national tournament. Yes, they have done better the last couple years (except for maybe ONU this year), but I cannot respect any decision-making process that would allow for Illinois College in 2011 - the 97th rated team in the country and 3rd best team in the 21st rated conference in the country - to receive an at-large bid. That year, Illinois College had a 0.347 SoS in my system (essentially the average true winning percentage of their opponents), and 0.493 by the NCAA's metric, good for the 170th and 127th toughest schedule in the country, respectively. The Blueboys only played two teams that season who were rated above 0.500 (average), but somehow had a repectably average strength of schedule according to the NCAA.

The method that the NCAA uses to determine a team's SoS has several flaws, which I'm going to point out below, but first I'll give you a quick rundown of how the NCAA calculates a team's strength of schedule.

A team's strength of schedule is determined using two factors: opponents' winning percentage (OWP) and opponents' opponents' winning percentage (OOWP). According to d3football.com, the equation used by the NCAA is:

-- Opponents’ average winning percentage (OWP), weighted 2/3.

-- Opponents’ opponents’ average winning percentage (OOWP), weighted 1/3

...but that's not exactly right.

First, opponents' winning percentages aren't averaged. OWP is the sum of opponents' wins divided by the sum of total games played. Also, in calculating OWP, the results of games versus the team being analysed are excluded. A quick hypothetical to help you understand:

Suppose the University of Okoboji goes 10-0 on the year, and every opponent they played finished with a record of 5-5. At first glance, you would expect their OWP to be 0.5000, but it would actually be 0.5556, because their opponents' 10 cumulative losses to the University of Okoboji are excluded. Instead of calculating (50 wins)/(50 wins + 50 losses), the equation is now (50 wins)/(50 wins + 40 losses). Suppose those 10 opponents all have an OWP of 0.5000, making Okoboji's OOWP an even 0.5000. Their final SoS would be (2/3)*0.5556 + (1/3)*0.5000 = 0.5370.

Now here's why this method sucks:

1. Bias is introduced in the calculation of OOWP

I said above that the results of the team being analyzed are excluded from OWP, but for some reason, they are not excluded from the calculation of OOWP. Let's go back to the University of Okoboji example to understand what I mean.

All of Okoboji's opponents have a record of 5-5 and an OWP of 0.5000. Amongst each and every one of their OWP's lingers Okoboji at 10-0 (essentially 9-0, excluding their victory over that opponent), meaning their other nine opponents had a cumulative winning percentage below 0.5000.

By removing the results of games against Okoboji in the calculation of OWP, the NCAA obviously was trying to isolate their opponents' quality from Okoboji's own record, but by not excluding Okoboji's record from the calculation of OOWP, they allowed Okoboji's record impact their own strength of schedule.

Suppose for a second that, all other things remaining equal, Okoboji's entire starting line-up was suspended for the entire season and they finish the year 0-10. Now instead of a SoS rating of 0.5370, they would have a SoS of 0.5037. Did their schedule change? No. Did their opponents' get worse because they had to play against Okoboji's backups? No. Why then did their SoS change from what would have been a Top 50 rating to a rating that would barely crack the Top 100?

This is obviously an oversimplified and extreme example of this bias. In reality, the OOWP metric would be much more resilient to changes in Okoboji's record. Most of their teams will be conference opponents, and most wins/losses by Okoboji would be offset by an equivalent loss/win by a different team. The biggest differences would be for independent teams or for teams in a small conference, where there would be fewer mutual opponents. Incidentally these are probably the teams for whom the SoS metric is most important, because they're not eligible for automatic (Pool A) bids.

2. Bias is introduced by teams playing different amounts of games

You may have noticed that I'm essentially assuming that every team in this hypothetical situation plays ten games. For the most part, this is true in DIII, but many teams out West only play nine games, and non-conference games outside of the division (which aren't included by the NCAA or my system) are common. Heck, for the four seasons they were an independent, Wesley never played more than seven regular season games against Division III opponents, and in 2014 they only played five.

This produces bias in the NCAA SoS results because they use cumulative games won and played by opponents in their calculation and not the average of opponents' winning percentage. So in 2014, when Wesley made the semifinals of the NCAA tournament, their opponents' OWP rating only included four games from Wesley, minimizing their impact. Wesley's contribution towards OWP was 56% less meaningful than a team who played ten games.

To show the impact this can have on a team's final SoS, let's look at Louisiana College in 2014, who played Wesley. Their OWP on the season was 0.6652 - best in the country. Had Wesley played a full ten games that season and maintained their perfect Division III record, Louisiana College's SoS would have increased to 0.7414 This is a difference of 0.0862, which is roughly equivalent to the difference between 50th ranked Dubuque and 187th ranked Western New England.

3. Larger conferences automatically have SoS's closer to the mean

Excluding the PAC and MWC, every conference plays a round-robin style regular season schedule. If we were to calculate a conference-only SoS using the NCAA's system, the conference's average SoS would always be 0.500 (because it's a closed loop; there has to be one victory for every loss). But not all coferences are created equal. The NCAA obviously recognizes this, and uses the OOWP metric to try to get some indication of a team's schedule relative to the nation, but the limited number of non-conference games for some teams again skews their results.

The UMAC does not field very competitive football teams on the national scale (they regularly rank last or second-to-last in my system and in d3football.com's conference rankings). The UMAC also plays a nine game conference schedule, meaning each team only plays one non-conference game per year. Here's a list of their teams' SoS as calculated by the NCAA in 2015:

For contrast, below are their SoS numbers according to my system. It's important to note that my system produces a SoS value on the same scale as the NCAA.

In case you needed further proof that the NCAA's method is bogus, I've also included the 2015 SoS results for the Ohio Athletic Conference, another ten-team league, below. I don't think anyone who knows anything about Division III football would confuse the competitive quality of these two conferences, but for some reason the NCAA does. When you're looking at the table below, ask yourself, "If I wanted to play an easy schedule to win more games, would I rather play Ohio Northern's 2015 schedule (NCAA: 0.487), or St. Scholatica's (NCAA: 0.499)?"

4. The relative quality of opponents at the time in which they played is not considered

The quality of a team varies throughout a season. Some teams get hit by the injury bug, some find new ways to utilize their personnel, and some might just lose their motivation. I used to think the changes in a team's true quality were rather minimal, and that preseason expectations were almost completely bogus, but my analysis seems to imply that a team's rating can change considerably, and that maybe preseason rankings aren't all that bad.

My ratings are actually rather bullish in the preseason. I originally had regressed each team's rating to the national average at the start of the season to avoid over-confidence, but after running some simulations, I removed any regression to the national mean. Instead I only use a team's own long-term trends, which produced much better predictions, as shown below:

My model produces a score prediction, from which a point spread can be determined. In the graph above, the standard error is the standard deviation of the difference between my predictions and actual outcomes of games. As you can see, my model makes its best predictions in Weeks 1, 2, and 9. My worst predictions are BY FAR in the playoffs (mostly due to small sample sizes). If the quality of a team were constant throughout a season, each subsequent week's predictions should be more accurate than the last. Instead, some of my best predictions are made early in the season, when we should know the least about a team's quality.

By using the ratings of a team's opponents at the time they played, I believe I'm giving the best indication of how difficult a team's complete schedule was. It may be asking too much of the NCAA to do the same, but it's most definitely not too much to ask them to remove bias from their formula through exclusion of a team's own record from their SoS calculation and by using a metric that places equal weight on each team played. If they wanted to take it a step further, they could even use some additional metric (OOOWP?) that gives a more true depiction of a conference's relative strength.

Hansen Ratings

Division III Football Ratings

A Treatise on Strength of Schedule

Comments