A major (180+ teams) 2008 national debate tournament was the point of comparison. The actual results of the seven preliminary rounds were available online. For the hypothetical tournament, round 1 used the same pre-set matches and results as the actual tournament. Round 2 and every subsequent even round used the strength-of-schedule power matching method: teams who had had good opponents so far faced the weaker ones in their bracket; teams who had had weak opponents so far faced the stronger ones in their bracket. Round 3 and every subsequent odd round used the traditional, high/low power matching method in TRPC. Once the pairing for a round was set, a "ballot" was entered for every debate. If the teams had met at the actual tournament, the real results were entered. If the teams had not met, each side received the speaker points they had earned in their respective debates for that round (against different opponents) at the actual tournament, and the winner was determined using the final rankings from the actual tournament (the higher ranking team won). Then, I power-matched then next round. Thus, the hypothetical tournament results were meant to replicate the actual tournament results as closely as possible, without bringing in any info that wouldn't have been available to a tab director during the tournament.
And the results are...
First, the strength-of-schedule power matching method generated only the correct, minimum number of pull-ups. Teams pulled up were always those with the weakest opposition record in their bracket. Otherwise, all teams were correctly paired within their brackets.
Second, the results from the hypothetical tournament closely matched those of the actual tournament. Of the 32 teams that made it into elimination rounds at the actual tournament, 30 of them would have made it into elimination rounds at the hypothetical tournament. (At the actual tournament, there was a four-way speaker point tie for 31st place, broken on opponent wins. The 30th and 32nd place teams at the actual tournament had different opponents — and lower opponent wins — at the hypothetical tournament and dropped below the threshold.)
Third, and most important, in every bracket except the 7-0s, the hypothetical tournament using the strength-of-schedule power matching method had narrower ranges for opponent wins and smaller standard deviations for opponent wins:
(With so few 7-0s, the addition of one outlier made a large impact. This is not the most revealing statistic.) The range for opponent wins for 6-1s at the actual tournament was nine, but four for the hypothetical tournament. That standard deviation was nearly halved. The range for 5-2s decreased from 15 to seven. The range for 4-3s decreased from 15 to ten.
Another telling statistic is the comparison for the top 32 teams at the actual and the top 32 teams at the hypothetical tournament:
The ranges were the same, but the standard deviation was far lower at the hypothetical tournament, meaning that more of the data points were closer to the mean opponent wins. The middle 94% of the top 32 teams, eliminating the highest and lowest outlier, is even more telling. The range for the actual tournament, even eliminating the outliers, was still 12; the range for the hypothetical tournament dropped to eight. You can see that the standard deviation is nearly one opponent win lower.
The bottom line is, it worked. There was nothing in the process that required human judgment or that would be difficult to do in a computer program:
- Retrieve the relevant statistics from the tabulation program.
- Assign byes.
- Calculate the z-scores of a team's strength and its opponents' strength.
- Populate an optimization matrix.
- Solve the optimization matrix using the Hungarian algorithm.
- Feed the solution back into the tabulation program as the next round’s pairings.
Read the full paper for a description of the method I used:
And here's the Java code for the Hungarian algorithm I used: http://www.mediafire.com/?y3nqjtmnim2