With those caveats stated, here's the first problem with win/loss record as a measure of team strength: no team debates a representative sample of the teams at the tournament. Every team debates six opponents out of n teams at the tournament. Except for round robins and very small tournaments, the proportion isn't very large. At a normal-sized tournament, it might be under 10%. Recognizing this, tournaments do not use randomly selected opponents. Brackets select a subset of opponents for teams to debate, and the results are more informative than if opponent selection is random. While there are many pathways through the tournament (e.g., WWWLLL versus WLWLWL), usually the key is what caliber of opponent a team beats and by what caliber of opponent a team is beaten. For example, a team who beats a 2-4 and loses to a 4-2 will, because of the way the brackets work, most often end up with a 3-3 record, revealing that this team is probably in the middle third of the tournament. Of course, sometimes the brackets don't work perfectly in this way; for example, a team that beats a 4-2 might end up with a 3-3 record. Clearly, in this case, the overall win/loss record is not an accurate reflection of one or both teams' strength.

It is possible to create many different, more accurate measures of team strength that account for schedule strength using complicated formulas. But I think a measure needs to be relatively easy to understand and transparent, if it's to be adopted. The measure I developed (from a suggestion from Steve Gray) is weighted wins/weighted losses. Let's say team A beats teams B, C, and D, and loses to team G: 3 wins, 1 loss. But what if team B was a 3-1 team, team C was a 2-2 team, team D was an 0-4 team, and team G was a 3-1 team? The win against B ought to count for more than the win against D. The weighted measures would give team A exactly 8 "wins" (3 wins + 3 + 2 + 0) and 2 "losses" (1 loss + 1): it gains an extra "win" for every win of each opponent it beats (B, +3; C + 2; and D, + 0) and incurs an extra "loss" for every loss of each opponent that defeats it (G, - 1). As a first step, this already makes an enormous difference in assessing a team's true strength. Teams that defeat good opponents have more weighted wins than teams that defeat mediocre opponents, even if they have the same win/loss record. (Of course, it's possible that a team that has only been paired against mediocre opponents is actually very good -- which will get sorted out through power-matching!)

The method becomes enormously powerful if it re-iterates: for example, team A is now treated as having 8 wins and 2 losses, just as all its opponents are shown with their weighted wins and losses. Let's say B has a weighted record of 5-2; C, 4-4; D, 0-7; and G, 6-2. In the second iteration, team A will have 12 re-weighted wins (3 wins + 5 + 4 + 0) and 3 losses (1 loss + 2). The process can continue to be re-iterated until a reasonable stopping point, say, a team has as many or more weighted wins than there are teams at the tournament (or as many losses)! Based on this rule, the process will re-iterate about log(n) times for an n-team tournament.

For simplicity's sake, I would turn the weighted wins and weighted losses into one statistic:

where r is the number of rounds at the tournament. The first term will create something like a handicapped win percentage.

I ran this method on a small four-round tournament, which took only three iterations. You can see the results here:

In only one case, highlighted yellow, did the method rank a team above someone that it beat. I highlighted in green three teams that dramatically moved up under this ranking versus a traditional wins-speaker points method and in pink four teams that dramatically moved down. Based on their schedule strengths, these all seem pretty defensible to me.

Some might point out that there's an inherent difficulty created by "upsets," where good teams happen to get knocked off by bad teams. How much is the good team "punished," or pushed down in the rankings, by that loss? I thought a different kind of data set, where the teams play a much more representative sample, would show the basic sanity of the weighted wins approach. I used it to rank the 2009 N.F.L. regular season because there are so many "upsets" in football (about 25%! -- much higher than debate tournaments, where there are about 5%). You can see how well the method I described handles the unusual losses:

As you can see, it does a reasonable job, despite lots of unusual losses. [

*Note*: weighted wins is scaled differently here for an unimportant technical reason (I was confounded by how to deal with the multiple games teams play against the same opponents), but the method is the same.] Here is a second post on weighted wins.

## No comments:

## Post a Comment