A major (180+ teams) 2008 national debate tournament was the point of comparison. The actual results of the seven preliminary rounds were available online. For the hypothetical tournament, round 1 used the same pre-set matches and results as the actual tournament. Round 2 and every subsequent even round used the strength-of-schedule power matching method: teams who had had good opponents so far faced the weaker ones in their bracket; teams who had had weak opponents so far faced the stronger ones in their bracket. Round 3 and every subsequent odd round used the traditional, high/low power matching method in TRPC. Once the pairing for a round was set, a "ballot" was entered for every debate. If the teams had met at the actual tournament, the real results were entered. If the teams had not met, each side received the speaker points they had earned in their respective debates for that round (against different opponents) at the actual tournament, and the winner was determined using the final rankings from the actual tournament (the higher ranking team won). Then, I power-matched then next round. Thus, the hypothetical tournament results were meant to replicate the actual tournament results as closely as possible, without bringing in any info that wouldn't have been available to a tab director during the tournament.

And the results are...

First, the strength-of-schedule power matching method generated only the correct, minimum number of pull-ups. Teams pulled up were always those with the weakest opposition record in their bracket. Otherwise, all teams were correctly paired within their brackets.

Second, the results from the hypothetical tournament closely matched those of the actual tournament. Of the 32 teams that made it into elimination rounds at the actual tournament, 30 of them would have made it into elimination rounds at the hypothetical tournament. (At the actual tournament, there was a four-way speaker point tie for 31st place, broken on opponent wins. The 30th and 32nd place teams at the actual tournament had different opponents — and lower opponent wins — at the hypothetical tournament and dropped below the threshold.)

Third, and most important, in every bracket except the 7-0s, the hypothetical tournament using the strength-of-schedule power matching method had narrower ranges for opponent wins and smaller standard deviations for opponent wins:

(With so few 7-0s, the addition of one outlier made a large impact. This is not the most revealing statistic.) The range for opponent wins for 6-1s at the actual tournament was nine, but four for the hypothetical tournament. That standard deviation was nearly halved. The range for 5-2s decreased from 15 to seven. The range for 4-3s decreased from 15 to ten.

Another telling statistic is the comparison for the top 32 teams at the actual and the top 32 teams at the hypothetical tournament:

The ranges were the same, but the standard deviation was far lower at the hypothetical tournament, meaning that more of the data points were closer to the mean opponent wins. The middle 94% of the top 32 teams, eliminating the highest and lowest outlier, is even more telling. The range for the actual tournament, even eliminating the outliers, was still 12; the range for the hypothetical tournament dropped to eight. You can see that the standard deviation is nearly one opponent win lower.

The bottom line is, it worked. There was nothing in the process that required human judgment or that would be difficult to do in a computer program:

- Retrieve the relevant statistics from the tabulation program.
- Assign byes.
- Calculate the z-scores of a team's strength and its opponents' strength.
- Populate an optimization matrix.
- Solve the optimization matrix using the Hungarian algorithm.
- Feed the solution back into the tabulation program as the next round’s pairings.

Read the full paper for a description of the method I used:

And here's the Java code for the Hungarian algorithm I used: http://www.mediafire.com/?y3nqjtmnim2

I think it's possible that using hypothetical ballots based on the actual final ranking may be seeding your results with data that makes SOS pairing easier.

ReplyDeleteIn a real tournament, teams debate above and below their actual ability. This can skew SOS results when, for instance, a team loses its first three rounds, but wins thereafter.

Expecting each team to perform at their average ability each time might make pairing teams based on performance after round i more likely to reflect fair pairings after round i+k.

To test this, you might rerun the tournament using the old pairing algorithm (+ different random presets), but with only hypothetical ballots. If I am right that this artificially smooths the brackets, the average tournament may be have more balanced SOSs at the end of the prelims.

I'm not sure that trying it out for just one tournament is good enough, though, since that tournament might have achieved very high quality SOS equalization or very low quality SOS equalization just by chance. This is where using large datasets like those provided by Jon Brushke can come in handy.

By the way, if you want to see some REALLY unbalanced SOS, check out the novice breakout at CEDA nats 2008-2009: the 2nd place team as 34 opp wins, the first place has 16!

Upon reviewing my own work here, I now realize that there is a way to increase the power of the strength-of-schedule algorithm to make it work even more efficiently. When I get the time, I'm going to re-run a simulation and use A Numbers Game's suggestion as my control. I expect to yield more dramatic results than these.

ReplyDelete