With that caveat stated, here's the first problem with win/loss record as a measure of team strength: no team debates a representative sample of the teams at the tournament. Every team debates six opponents out of n teams at the tournament. Except for round robins and very small tournaments, the proportion isn't very large. At a normal-sized tournament, it might be under 10%.
Recognizing this, tournaments do not use randomly selected opponents. Brackets select a subset of opponents for teams to debate, and the results are more informative than if opponent selection is random. While there are many pathways through the tournament (e.g., WWWLLL versus WLWLWL), usually the key is what caliber of opponent a team beats and by what caliber of opponent a team is beaten. For example, a team who beats a 2-4 and losses to a 4-2 will, because of the way the brackets work, most often end up with a 3-3 record, revealing that this team is in the middle third (between the 34th and 66th percentile) of the tournament. Of course, sometimes the brackets don't work perfectly in this way; for example, a team that beats a 4-2 might end up with a 3-3 record. Clearly, in this case, the overall win/loss record is not an accurate reflection of one or both teams' strength. I began to think about whether another record that it is a better measure.
I think the answer is a "transitive" win/loss record. Rather than give a team a win/loss record only for direct wins and losses, instead give every team a win/loss record for both direct and transitive wins and losses. What is a transitive win? If team a beats b (a > b) and team b beats c (b > c), then team a gets the direct win against b and a transitive win against c (a > b > c). The same holds true for losses. If team a losses to d (a < d) and d losses to e (d < e), then team a gets a direct loss against d and a transitive loss against e (a < d < e). The process is repeated until every transitive win or loss by any number of steps is recorded. Now, before complaining that this isn't fair and team a didn't actually beat c or lose to e, bear in mind my caveat: this isn't a method for final rankings. It's a method for estimating a team's strength. The question is not whether this is fair, but whether it is accurate, and the answer is that a team's transitive win/loss record is a more accurate measure of a team's true strength than its direct win/loss record (traditional). Transitive win/loss record gives a better sense of where a team fits compared to the entire pool.
I looked at a small tournament (33 teams) to start. The top team could be traced to 4 direct and 26 transitive wins and to 0 direct (or transitive, therefore) losses. The bottom team could be traced to 0 direct wins and to 4 direct and 21 transitive losses. The median team could be traced to 2 direct and 7 transitive wins and to 2 direct and 7 transitive losses. (There are notes on the algorithm for calculating transitive wins at the end of this post, but for now, I wanted to focus on the results.) I did think at first about using this transitive win/loss record to calculate a percentile like so: (direct and transitive wins) / (direct and transitive wins + direct and transitive losses). The top team would be 100th percentile; the bottom team 0th; and the median team happened to be exactly 50th by this calculation. One nice result is that, by this calculation, every team is ranked above every opponent it beat. If team a beats b, then team a's direct and transitive wins must be greater than b's, since a gets "credit" for all of b's wins, and therefore b's wins are a subset of a's. Likewise for transitive losses: if a beats b, a's losses must be smaller than b's. (This calculation also is a more accurate measure of a team's overall schedule strength than the sum of all opponents wins, since teams get "credits" only for opponents beaten and actually receive "demerits" for opponents that defeat it. For these two reasons, I think this calculation is a good candidate to replace opponent wins as a tie-breaker in final rankings.) I was almost tempted to use this calculation as the measure by itself, except for the River Gimbel problem.
River Gimbel was a debater at this tournament. For most of the competitors, direct and transitive wins + direct and transitive losses was in the high teens -- 16 to 21 was the IQR. Given that the tournament was only 33 teams, these competitors could be compared, directly or transitively, to half or more of the entire pool. River Gimbel could be compared to six. River lost to two of the top teams (thus acquiring two direct losses and two transitive losses) and won against two of the bottom teams (two direct wins and zero transitive wins). The calculation placed him in the 33rd percentile. This result hardly seems accurate. At the worst extreme, he was the 3rd worst team (since at a minimum two teams were below him), or 9th percentile. At the best extreme, he was the 5th best team (since at a minimum four teams were above him), or 88th percentile. That's a wide range. The problem is that this calculation does not recognize the real lack of information due to his bizarre schedule. So, an adjustment:

The first line just represents the direct and transitive wins record, scaled to the size of the tournament (total teams) rather than to the "size" of a debater's direct and transitive record (wins + losses).
If the first line is small, as with any debater at the beginning of the tournament or River Gimbel at the end, the second line fills in the gap. The first factor represents the fraction of all the opponents at the tournament against which a team has no direct or transitive comparison. The second factor provides a best guess about a team's strength based on the average of its percentile for speaker points and its percentile for judge variance. Based on this, River Gimbel moved up to 18th place, or 48th percentile, because he had slightly below average speaker points.
There are three nice features of this calculation. First, even if River Gimbel had the worst speaker points at the tournament, with this calculation he would have received a 9th percentile estimate; and even if he had the best speaker points at the tournament, he would have received a 88th percentile estimate. In other words, this calculation always stays within the range created by the transitive win/loss record. (I leave the algebraic proof of this to the reader.) The transitive win/loss record creates a definite range -- e.g., worse than at least four teams, better than at least two -- and speaker points are used to estimate a position within that range.
Second, it is rare that a team is ranked below an opponent they beat. Basically, it can only happen if it was a low-point win, although not all low-point wins are sufficient to cause this. This seems reasonable: a high-point loser might be the stronger team overall, the loss a result of a dumb mistake rather than an overall lack of skills.
Third, in the first few rounds, the second line is more important, and speaker points are the primary way to assess strength. However, after several rounds, the first line is more important, and transitive win/loss records are the primary way to assess strength. In other words, at the beginning of the tournament, this calculation gives a reasonable guess about a team's strength, but as the tournament progresses, the ranges given by transitive records rapidly become tighter and the estimate of a team's strength converges to the truth.
The problem with evaluating any estimate of a team's strength is that, well, the team's true strength is unknown. Based on what I knew about the teams at this sample tournament, the results were pretty spot on. One could take a larger tournament, calculate estimates of team strength based on all the prelims, and see if this holds up during the elimination rounds -- but elimination rounds are few, and they introduce the confounding variables of new cases and anxiety. I'm up for anyone's suggestion of how to test it out.
Notes on calculating transitive wins and losses:
1. First, create an incidence matrix for the tournament. All the teams are listed along the rows; all the teams are listed along the columns. A score of 1 in (i, j) represents a win by team i over team j. All other cells should be blank.
2. Second, the rows and columns (using the same sorting system) must be sorted to produce a triangular matrix. This is a first stab at ranking the teams in descending order (from top to bottom and from left to right).The diagonal represents a team debating itself, and all these cells are obviously blank. Ideally, all 1s should be in the upper right. Any 1s in the lower left represent a problem with the sorting -- a team has beaten an opponent ranked higher. Ones in the lower left may be easy to resolve by a simple re-ranking of the teams. Alternatively, a 1 might represent part of a cycle (a > b, b > c, a < c) that cannot be resolved by a simple re-ranking. All the 1s must be above the diagonal, which means that all cycles have to be broken. Breaking a cycle means that at least one of the results must be deleted, a 1 turned into a blank cell. Often, speaker points or some other tie-breaker is used to determine which win to override.
3. Third, the triangonlized matrix is raised to the second power. The resulting non-empty cells show first-order transitive wins. The triagonlized matrix is raised to the third. The resulting non-empty cells show second-order transitive wins (a > b > c > d, so a > d). The triagonlized matrix is raised to each successive power, until all cells are blank. (Cycles never terminate, and the process would continue infinitely.) Once the various matrices are added together -- [A]^1 + [A]^2 + [A]^3... -- then the transitive wins can be found by counting all the non-empty cells in a team's row, and the transitive losses can be found by counting all the non-empty cells in a team's column.
























