Thursday, December 31, 2009

A new measure of team strength: weighted wins

I've been thinking about and working for a while on a more accurate method of estimating a team's strength. Bear in mind, I'm not talking about the method for generating final preliminary rankings. The final ranking method is unlikely to ever change, which is not really a bad thing. There's a reason we're all rightfully attached to it: wins and total speaker points may not be the most accurate way to assess a team's strength, but it does seem the most just: those are the wins and points a team earned. So, I'm interested in how to estimate a team's strength only in order to make better power matches during the prelims, not to decide which teams break or don't break. As a second caveat, let me state that there's no way to say objectively that rankings are "correct"; a good ranking method is a good estimate of team strength, which varies anyway from round to round. All you can do is look at whether a ranking method yields some common sense results.

With those caveats stated, here's the first problem with win/loss record as a measure of team strength: no team debates a representative sample of the teams at the tournament. Every team debates six opponents out of n teams at the tournament. Except for round robins and very small tournaments, the proportion isn't very large. At a normal-sized tournament, it might be under 10%. Recognizing this, tournaments do not use randomly selected opponents. Brackets select a subset of opponents for teams to debate, and the results are more informative than if opponent selection is random. While there are many pathways through the tournament (e.g., WWWLLL versus WLWLWL), usually the key is what caliber of opponent a team beats and by what caliber of opponent a team is beaten. For example, a team who beats a 2-4 and loses to a 4-2 will, because of the way the brackets work, most often end up with a 3-3 record, revealing that this team is probably in the middle third of the tournament. Of course, sometimes the brackets don't work perfectly in this way; for example, a team that beats a 4-2 might end up with a 3-3 record. Clearly, in this case, the overall win/loss record is not an accurate reflection of one or both teams' strength.

It is possible to create many different, more accurate measures of team strength that account for schedule strength using complicated formulas. But I think a measure needs to be relatively easy to understand and transparent, if it's to be adopted. The measure I developed (from a suggestion from Steve Gray) is weighted wins/weighted losses. Let's say team A beats teams B, C, and D, and loses to team G: 3 wins, 1 loss. But what if team B was a 3-1 team, team C was a 2-2 team, team D was an 0-4 team, and team G was a 3-1 team? The win against B ought to count for more than the win against D. The weighted measures would give team A exactly 8 "wins" (3 wins + 3 + 2 + 0) and 2 "losses" (1 loss + 1): it gains an extra "win" for every win of each opponent it beats (B, +3; C + 2; and D, + 0) and incurs an extra "loss" for every loss of each opponent that defeats it (G, - 1). As a first step, this already makes an enormous difference in assessing a team's true strength. Teams that defeat good opponents have more weighted wins than teams that defeat mediocre opponents, even if they have the same win/loss record. (Of course, it's possible that a team that has only been paired against mediocre opponents is actually very good -- which will get sorted out through power-matching!)

The method becomes enormously powerful if it re-iterates: for example, team A is now treated as having 8 wins and 2 losses, just as all its opponents are shown with their weighted wins and losses. Let's say B has a weighted record of 5-2; C, 4-4; D, 0-7; and G, 6-2. In the second iteration, team A will have 12 re-weighted wins (3 wins + 5 + 4 + 0) and 3 losses (1 loss + 2). The process can continue to be re-iterated until a reasonable stopping point, say, a team has as many or more weighted wins than there are teams at the tournament (or as many losses)! Based on this rule, the process will re-iterate about log(n) times for an n-team tournament.

For simplicity's sake, I would turn the weighted wins and weighted losses into one statistic:

where r is the number of rounds at the tournament. The first term will create something like a handicapped win percentage.

I ran this method on a small four-round tournament, which took only three iterations. You can see the results here:


In only one case, highlighted yellow, did the method rank a team above someone that it beat. I highlighted in green three teams that dramatically moved up under this ranking and in pink four teams that dramatically moved down. Based on their schedule strengths, these all seem pretty defensible to me.

Some might point out that there's an inherent difficulty created by "upsets," where good teams happen to get knocked off by bad teams. How much is the good team "punished," or pushed down in the rankings, by that loss? I thought a different kind of data set, where the teams play a much more representative sample, would show the basic sanity of the weighted wins approach. I used it to rank the 2009 NFL regular season because there are so many "upsets" in football (about 25%! -- much higher than debate tournaments, where there are about 5%). You can see how well the method I described handles the unusual losses:


As you can see, it does a reasonable job, despite lots of unusual losses.

Tuesday, December 15, 2009

Pascal's triangle (modified)

I started thinking about this problem in a specific debate context (for a specific application), before realizing that it isn't so useful after all. Still, the math is quite interesting.

Here's the original question that got me thinking: Is there a way to assign weighted wins -- round 1 wins count for so much, round 2 wins count for something different, and so on -- such that a tournament can produce a ranking that is consistent with the actual results, such that if team A beats team B, and they finish with the same win-loss record, A must be ranked above B? (Thanks to Steve Gray for coming up with the original idea to use weighted wins this way.) The point values for each round must be decreasing for this to work. If team A beats team B in round 1 for 1 point, then all subsequent point values must be smaller, or else B would have the chance to tie (with an equal number of wins) or surpass A (with an equal number of wins -- of course, B should surpass A if it wins more total rounds than A). I tested out a specific set of weighted win points that meets both of these conditions: {1, 0.9, 0.89, 0.889, 0.8889, 0.88889}. I'll write more on why this particular set works later in this post. The results of the test follows:


Pick any point where you might compare two teams, and you can see that you can always tell who got there by winning the earlier round(s). As you can see from the bottom row, it is possible to distinguish every one of the 64 possible pathways (2^6) that a team could take through the tournament (e.g., WLLWWL), and the results line up nicely in order -- that is, an earlier win always ranks a team higher (e.g., WWLWLL is ranked higher than WLLWWL). The earlier win is ranked higher, because that way, there is no possibility a team is ranked lower than someone they beat. NB: This only works if everyone debates within brackets. It fails to hold as a true assumption for pull-up rounds, so this system would fail utterly for round robins.

However, it is not the pull-up problem that scuppers this system (I think it's somewhat solvable for one-win pull-up rounds with half-points). The problem is that this system produces a ranking that is consistent with the actual win-loss results; it does not necessarily produce the most desirable ranking. True, teams are never ranked below someone they beat, but this system overdoes it: teams are ranked by the order in which they won rounds. What about an excellent 4-2 team with great speaker points who loses the first two random rounds to the top 4-2 teams? To the good, this excellent 4-2 team will be ranked below the top two 4-2 teams (because LLWWWW is worth fewer points than, for example, WLWWWL and LWWWLW); to the excess, this team will be ranked below EVERY 4-2 team who happened to win either of the first two rounds. It's throwing the baby out with the bath water. Furthermore, it makes no sense to use weighted wins as a third or fourth tie-breaker. If speaker points are the first tie-breaker, then it becomes possible to rank a team below someone they have beaten. It's an all-or-nothing solution, and no one would think that the rankings it produces are good.

The math behind this problem turns out to be more interesting than the originally intended application. Some patterns emerge in any set of numbers that work for weighted wins. In the set of numbers {a, b, c, d, e, f}, the terms must be decreasing, so a > b > c > d > e > f is the first condition. However, not just any set of decreasing terms will work. For example, if the weighted win points were {1, 0.9, 0.8, 0.7, 0.6, 0.5}, the system breaks down: for two 2-0 teams, a round 1 and round 6 win totals 1.5 points, but it should beat a round 2 and round 3 win, which totals 1.7 points. Therefore, the second, trickier condition is that the decreases must decrease, that is, while the slope is negative (negative first derivative) the set must have upward concavity (positive second derivative). Going to my original set of numbers {1, 0.9, 0.89, 0.889, 0.8889, 0.88889}, the terms are decreasing, meeting the first condition, and the decrease is decreasing {-0.1, -0.01, -0.001, -0.0001, -0.00001}, meeting the second condition.

Looking at the chart, every unmarked place is one where any set of numbers that meets only the first condition will be true. For example, on the line of round 3, as long as a > b > c, then a+b > a+c > b+c; it's analytically true. Every marked place is one where a set of numbers that meets only the first condition might fail; only sets meeting both conditions must be true. For example, on the line of round 4, a+d must be > b+c only if the second condition is also met. I made up a term for a set that meets the conditions of this problem: an "inequality ordered set." The definition would be: a set of numbers, such that for any two equally-sized subsets created without repetition of terms, the subset with the single largest term always has the largest sum. For example, a+f > b+c and a+e+f > b+c+d.

I wondered whether it is possible to create an infinitely long "inequality ordered set." Two possible ways to create such an infinite set: {1, (1/x), (1/x^2), ...} and {1, 1-(1/x), 1-(1/x)-(1/x^2), ...}. For the first kind, the smallest x that still works is about 2. For the second kind, the smallest x that still works is about 1.25. I'm working on writing a proof of these.