## Tuesday, July 3, 2012

### Probability of upsets

A team has an average strength or skill level, which is how well we expect it to debate in a typical round. This is the same as the team's tournament-long average strength (teams probably improve during the course of the entire season). But a team's strength is also variable: in any given round, it might debate better or worse than its average. This variability should follow a normal distribution. When two teams debate, either might debate above or below its average. How to model this?

The horizontal axis shows possible performances of team 1, based on a normal distribution centered at 0 (indicating an exactly average performance for team 1 based on its average strength). The vertical axis shows possible performances of team 2, again a normal distribution around 0, the average-strength performance.

Let's say that team 1 is significantly stronger than team 2. In order for team 2 to win, it must have a much better than average performance -- and team 1 would have to have a much worse than average performance. In other words, only some of the possible results in quadrant 2 would result in a team 2 win, like so:

The red cases highlight the upsets. Rare indeed, because team 1 must underperform and team 2 must overperform. As an alternative, consider the scenario that team 1 and team 2 are evenly matched. In this world, team 2 wins about 50% of the time:

Mathematically, it is simple to model this with a logistic function. If difference = team 1 strength - team 2 strength, then the formula for the probability of team 1 winning is

where k depends on the units in which strength is measured and just how variable the teams' performances are. The value of k is an empirical research question that could change from season to season. The logistic function looks like this:

As the difference gets larger, team 1 is stronger and more likely to win, approaching 100%. As the difference turns negative, team 1 is weaker and less likely to win, approaching 0%. And at a difference of 0, the teams are even, and the odds are 50-50.

I analyzed the 2010-2011 season for open/varsity policy debate for CEDA/NDT data. I looked at each team's strength, using the easy-to-understand measure of weighted wins, expressed as an expected win percentage for a season (so, 62% means that a team is expected to win 62% of its rounds in an entire season, adjusted slightly from its actual win percentage by schedule strength). Then I analyzed all the rounds that happened, based on the difference in the two teams' strengths, as either wins (for the higher rated team) or upsets (for the lower rated team).

I found that about 20% of rounds were upsets. This is close to football's 25% or so. But of course, most of the upsets occur when the teams are fairly close in rating. Here are the results:

So, for example, when the difference in the ratings was greater than 0.5 but less than 0.55, the higher rated team won 97.3% of the time. This is obviously a significant difference in the teams' strengths: a team rated at 82% weighted wins versus a team weighted at 30% weighted wins! It is hardly surprising that this is such a lock. At the other extreme, when the difference in the ratings is greater than 0.1 but less than 0.15, the higher rated team only wins about 59% of the time. These are close rounds, nearly toss-ups. A difference of 0.2 seems to be the tipping point: above this, there are few upsets.

Here is the same data in graph form:

A line of best fit is modeled. Using the formula above, my best guess is that k is about 6.5.