Thursday, December 31, 2009

A new measure of team strength: transitive win/loss record

I've been thinking about and working for a while on a more accurate method of estimating a team's strength. Bear in mind, I'm not talking about the method for generating final preliminary rankings. The final ranking method is unlikely to ever change, which is not really a bad thing. There's a transparency to it that we're all rightfully attached to: wins and total speaker points may not be the most accurate way to assess a team's strength, but it does seem the most just: those are the wins and points a team earned. So, I'm interested in how to estimate a team's strength only in order to make better power matches during the prelims, not to rank teams and decide which ones break or don't break.

With that caveat stated, here's the first problem with win/loss record as a measure of team strength: no team debates a representative sample of the teams at the tournament. Every team debates six opponents out of n teams at the tournament. Except for round robins and very small tournaments, the proportion isn't very large. At a normal-sized tournament, it might be under 10%.

Recognizing this, tournaments do not use randomly selected opponents. Brackets select a subset of opponents for teams to debate, and the results are more informative than if opponent selection is random. While there are many pathways through the tournament (e.g., WWWLLL versus WLWLWL), usually the key is what caliber of opponent a team beats and by what caliber of opponent a team is beaten. For example, a team who beats a 2-4 and losses to a 4-2 will, because of the way the brackets work, most often end up with a 3-3 record, revealing that this team is in the middle third (between the 34th and 66th percentile) of the tournament. Of course, sometimes the brackets don't work perfectly in this way; for example, a team that beats a 4-2 might end up with a 3-3 record. Clearly, in this case, the overall win/loss record is not an accurate reflection of one or both teams' strength. I began to think about whether another record that it is a better measure.

I think the answer is a "transitive" win/loss record. Rather than give a team a win/loss record only for direct wins and losses, instead give every team a win/loss record for both direct and transitive wins and losses. What is a transitive win? If team a beats b (a > b) and team b beats c (b > c), then team a gets the direct win against b and a transitive win against c (a > b > c). The same holds true for losses. If team a losses to d (a < d) and d losses to e (d < e), then team a gets a direct loss against d and a transitive loss against e (a < d < e). The process is repeated until every transitive win or loss by any number of steps is recorded. Now, before complaining that this isn't fair and team a didn't actually beat c or lose to e, bear in mind my caveat: this isn't a method for final rankings. It's a method for estimating a team's strength. The question is not whether this is fair, but whether it is accurate, and the answer is that a team's transitive win/loss record is a more accurate measure of a team's true strength than its direct win/loss record (traditional). Transitive win/loss record gives a better sense of where a team fits compared to the entire pool.

I looked at a small tournament (33 teams) to start. The top team could be traced to 4 direct and 26 transitive wins and to 0 direct (or transitive, therefore) losses. The bottom team could be traced to 0 direct wins and to 4 direct and 21 transitive losses. The median team could be traced to 2 direct and 7 transitive wins and to 2 direct and 7 transitive losses. (There are notes on the algorithm for calculating transitive wins at the end of this post, but for now, I wanted to focus on the results.) I did think at first about using this transitive win/loss record to calculate a percentile like so: (direct and transitive wins) / (direct and transitive wins + direct and transitive losses). The top team would be 100th percentile; the bottom team 0th; and the median team happened to be exactly 50th by this calculation. One nice result is that, by this calculation, every team is ranked above every opponent it beat. If team a beats b, then team a's direct and transitive wins must be greater than b's, since a gets "credit" for all of b's wins, and therefore b's wins are a subset of a's. Likewise for transitive losses: if a beats b, a's losses must be smaller than b's. (This calculation also is a more accurate measure of a team's overall schedule strength than the sum of all opponents wins, since teams get "credits" only for opponents beaten and actually receive "demerits" for opponents that defeat it. For these two reasons, I think this calculation is a good candidate to replace opponent wins as a tie-breaker in final rankings.) I was almost tempted to use this calculation as the measure by itself, except for the River Gimbel problem.

River Gimbel was a debater at this tournament. For most of the competitors, direct and transitive wins + direct and transitive losses was in the high teens -- 16 to 21 was the IQR. Given that the tournament was only 33 teams, these competitors could be compared, directly or transitively, to half or more of the entire pool. River Gimbel could be compared to six. River lost to two of the top teams (thus acquiring two direct losses and two transitive losses) and won against two of the bottom teams (two direct wins and zero transitive wins). The calculation placed him in the 33rd percentile. This result hardly seems accurate. At the worst extreme, he was the 3rd worst team (since at a minimum two teams were below him), or 9th percentile. At the best extreme, he was the 5th best team (since at a minimum four teams were above him), or 88th percentile. That's a wide range. The problem is that this calculation does not recognize the real lack of information due to his bizarre schedule. So, an adjustment:


The first line just represents the direct and transitive wins record, scaled to the size of the tournament (total teams) rather than to the "size" of a debater's direct and transitive record (wins + losses).

If the first line is small, as with any debater at the beginning of the tournament or River Gimbel at the end, the second line fills in the gap. The first factor represents the fraction of all the opponents at the tournament against which a team has no direct or transitive comparison. The second factor provides a best guess about a team's strength based on the average of its percentile for speaker points and its percentile for judge variance. Based on this, River Gimbel moved up to 18th place, or 48th percentile, because he had slightly below average speaker points.

There are three nice features of this calculation. First, even if River Gimbel had the worst speaker points at the tournament, with this calculation he would have received a 9th percentile estimate; and even if he had the best speaker points at the tournament, he would have received a 88th percentile estimate. In other words, this calculation always stays within the range created by the transitive win/loss record. (I leave the algebraic proof of this to the reader.) The transitive win/loss record creates a definite range -- e.g., worse than at least four teams, better than at least two -- and speaker points are used to estimate a position within that range.

Second, it is rare that a team is ranked below an opponent they beat. Basically, it can only happen if it was a low-point win, although not all low-point wins are sufficient to cause this. This seems reasonable: a high-point loser might be the stronger team overall, the loss a result of a dumb mistake rather than an overall lack of skills.

Third, in the first few rounds, the second line is more important, and speaker points are the primary way to assess strength. However, after several rounds, the first line is more important, and transitive win/loss records are the primary way to assess strength. In other words, at the beginning of the tournament, this calculation gives a reasonable guess about a team's strength, but as the tournament progresses, the ranges given by transitive records rapidly become tighter and the estimate of a team's strength converges to the truth.

The problem with evaluating any estimate of a team's strength is that, well, the team's true strength is unknown. Based on what I knew about the teams at this sample tournament, the results were pretty spot on. One could take a larger tournament, calculate estimates of team strength based on all the prelims, and see if this holds up during the elimination rounds -- but elimination rounds are few, and they introduce the confounding variables of new cases and anxiety. I'm up for anyone's suggestion of how to test it out.


Notes on calculating transitive wins and losses:

1. First, create an incidence matrix for the tournament. All the teams are listed along the rows; all the teams are listed along the columns. A score of 1 in (i, j) represents a win by team i over team j. All other cells should be blank.

2. Second, the rows and columns (using the same sorting system) must be sorted to produce a triangular matrix. This is a first stab at ranking the teams in descending order (from top to bottom and from left to right).The diagonal represents a team debating itself, and all these cells are obviously blank. Ideally, all 1s should be in the upper right. Any 1s in the lower left represent a problem with the sorting -- a team has beaten an opponent ranked higher. Ones in the lower left may be easy to resolve by a simple re-ranking of the teams. Alternatively, a 1 might represent part of a cycle (a > b, b > c, a < c) that cannot be resolved by a simple re-ranking. All the 1s must be above the diagonal, which means that all cycles have to be broken. Breaking a cycle means that at least one of the results must be deleted, a 1 turned into a blank cell. Often, speaker points or some other tie-breaker is used to determine which win to override.

3. Third, the triangonlized matrix is raised to the second power. The resulting non-empty cells show first-order transitive wins. The triagonlized matrix is raised to the third. The resulting non-empty cells show second-order transitive wins (a > b > c > d, so a > d). The triagonlized matrix is raised to each successive power, until all cells are blank. (Cycles never terminate, and the process would continue infinitely.) Once the various matrices are added together -- [A]^1 + [A]^2 + [A]^3... -- then the transitive wins can be found by counting all the non-empty cells in a team's row, and the transitive losses can be found by counting all the non-empty cells in a team's column.

Tuesday, December 15, 2009

Pascal's triangle (modified)

I started thinking about this problem in a specific debate context (for a specific application), before realizing that it isn't so useful after all. Still, the math is quite interesting.

Here's the original question that got me thinking: Is there a way to assign weighted wins -- round 1 wins count for so much, round 2 wins count for something different, and so on -- such that a tournament can produce a ranking that is consistent with the actual results, such that if team A beats team B, and they finish with the same win-loss record, A must be ranked above B? (Thanks to Steve Gray for coming up with the original idea to use weighted wins this way.) The point values for each round must be decreasing for this to work. If team A beats team B in round 1 for 1 point, then all subsequent point values must be smaller, or else B would have the chance to tie (with an equal number of wins) or surpass A (with an equal number of wins -- of course, B should surpass A if it wins more total rounds than A). I tested out a specific set of weighted win points that meets both of these conditions: {1, 0.9, 0.89, 0.889, 0.8889, 0.88889}. I'll write more on why this particular set works later in this post. The results of the test follows:


Pick any point where you might compare two teams, and you can see that you can always tell who got there by winning the earlier round(s). As you can see from the bottom row, it is possible to distinguish every one of the 64 possible pathways (2^6) that a team could take through the tournament (e.g., WLLWWL), and the results line up nicely in order -- that is, an earlier win always ranks a team higher (e.g., WWLWLL is ranked higher than WLLWWL). The earlier win is ranked higher, because that way, there is no possibility a team is ranked lower than someone they beat. NB: This only works if everyone debates within brackets. It fails to hold as a true assumption for pull-up rounds, so this system would fail utterly for round robins.

However, it is not the pull-up problem that scuppers this system (I think it's somewhat solvable for one-win pull-up rounds with half-points). The problem is that this system produces a ranking that is consistent with the actual win-loss results; it does not necessarily produce the most desirable ranking. True, teams are never ranked below someone they beat, but this system overdoes it: teams are ranked by the order in which they won rounds. What about an excellent 4-2 team with great speaker points who loses the first two random rounds to the top 4-2 teams? To the good, this excellent 4-2 team will be ranked below the top two 4-2 teams (because LLWWWW is worth fewer points than, for example, WLWWWL and LWWWLW); to the excess, this team will be ranked below EVERY 4-2 team who happened to win either of the first two rounds. It's throwing the baby out with the bath water. Furthermore, it makes no sense to use weighted wins as a third or fourth tie-breaker. If speaker points are the first tie-breaker, then it becomes possible to rank a team below someone they have beaten. It's an all-or-nothing solution, and no one would think that the rankings it produces are good.

The math behind this problem turns out to be more interesting than the originally intended application. Some patterns emerge in any set of numbers that work for weighted wins. In the set of numbers {a, b, c, d, e, f}, the terms must be decreasing, so a > b > c > d > e > f is the first condition. However, not just any set of decreasing terms will work. For example, if the weighted win points were {1, 0.9, 0.8, 0.7, 0.6, 0.5}, the system breaks down: for two 2-0 teams, a round 1 and round 6 win totals 1.5 points, but it should beat a round 2 and round 3 win, which totals 1.7 points. Therefore, the second, trickier condition is that the decreases must decrease, that is, while the slope is negative (negative first derivative) the set must have upward concavity (positive second derivative). Going to my original set of numbers {1, 0.9, 0.89, 0.889, 0.8889, 0.88889}, the terms are decreasing, meeting the first condition, and the decrease is decreasing {-0.1, -0.01, -0.001, -0.0001, -0.00001}, meeting the second condition.

Looking at the chart, every unmarked place is one where any set of numbers that meets only the first condition will be true. For example, on the line of round 3, as long as a > b > c, then a+b > a+c > b+c; it's analytically true. Every marked place is one where a set of numbers that meets only the first condition might fail; only sets meeting both conditions must be true. For example, on the line of round 4, a+d must be > b+c only if the second condition is also met. I made up a term for a set that meets the conditions of this problem: an "inequality ordered set." The definition would be: a set of numbers, such that for any two equally-sized subsets created without repetition of terms, the subset with the single largest term always has the largest sum. For example, a+f > b+c and a+e+f > b+c+d.

I wondered whether it is possible to create an infinitely long "inequality ordered set." Two possible ways to create such an infinite set: {1, (1/x), (1/x^2), ...} and {1, 1-(1/x), 1-(1/x)-(1/x^2), ...}. For the first kind, the smallest x that still works is about 2. For the second kind, the smallest x that still works is about 1.25. I'm working on writing a proof of these.

Sunday, September 13, 2009

Topic side bias

A Numbers Game did some interesting work on the side bias of various college topics, here, specifically, controlling for team strength. In that spirit, I decided to use a different method and see how the results compared.

I used a matched pairs method: for each team, there is a matched pair of results: that team's win percentage on the affirmative, and that team's win percentage on the negative. If the two results show no difference, then the team did equally well (or equally poorly) on both sides of the topic. If the two results do show a difference, there are three possible explanations: (1) the team isn't equally strong on both sides of the topic, e.g., the 2A isn't as good as the 2N; (2) the team hit an unequal set of opponents on the two sides; or (3) there is side bias on the topic. When one looks at all the teams on a topic, (1) is unlikely because the whole point is to control for team strength by assuming that the average team is equally strong on either side, (2) cancels out when one looks at the entire pool, and (3) is left as the most plausible outcome. Although this method still relies on the assumption of invariant strength (that a team has a fixed strength, the same on both sides of the topic, unchanging throughout the year), so does any other method that attempts to control for team strength. With those disclaimers, here are the results:


The third column shows the mean of the matched pairs computation: for each team, I subtracted its negative win percentage from its affirmative win percentage, and I averaged this score over all the teams that year. The fourth column shows a calculated (not the actual) affirmative win rate. They compare closely to the actual rates A Numbers Game already found. The two that are highlighted differ slightly. The affirmative win rate for the China topic my analysis suggests is slightly lower than the actual rate. The affirmative win rate for the courts topic my analysis suggests that the negative had an advantage, while the actual rate showed an affirmative advantage.


Category 1 is roughly the 0-50th percentile (in terms of rounds of competition); category 2 is 50-75th; category 3 is 75-87th; category 4 is 87-94th; and category 5 is 94-100th. You can see the results clearly in both: the less experienced teams had greater success on the negative; the more experienced teams did better (relatively or absolutely) on the affirmative. The reason why the originally calculated affirmative win rate was too low was because there are so many more less experienced teams that bring down the average matched comparison -- but they debate few rounds, so they do not have a big effect on the total ballot count. (I looked at the same tables for other years, but there were no patterns as clear as '05-06 and '06-07.)

One final note: All of the whole years' analyses are significant to at least the 95% level except for '06-07, which is only significant at about the 85% level. I used a 1-sample t-test, since the distributions are more or less normal:



(This is '08-09, but all the distributions look like this.) In fact, the distribution looks a little tighter than the normal curve (given the population's mean and standard deviation), since so many teams have 0% aff-neg win spread. The cumulative frequency graph is even more persuasive:


The blue line represents a normal CDF, given the population's ('08-09) mean and standard deviation. Anything below the line on the left or above the line on the right is tighter than normal. The reason is that the standard deviation is pulled way out by the teams that competed for few rounds (who had very high variability in the spread, from -1 to 1). For '08-09 for example, the standard deviation for the whole population is 0.31; excluding teams with fewer than 9 rounds experience, 0.22.

Based on A Numbers Game's question, I created a Lorenz curve for '08-09:


A few data points in words:

The top 5% of teams debated 19% of all rounds.
The top 10% of teams debated 34% of all rounds.
The top 20% of teams debated 54% of all rounds.
The top half of teams debated 83% of all rounds.

Tuesday, August 25, 2009

Judge side bias -- the whole field

I've been intrigued by A Numbers Game's information about judge side bias. Rather than looking at particular judges, I wanted to look at the distribution of all the judges. I plotted the (prelim) judging records of all the college tournaments in '04-05 (data from Bruschke's Debateresults.com)



(Excel workbook.) The vertical axis shows how many prelim rounds a judge judged; the horizontal axis shows the percentage of aff wins a judge gave. As you can see, the shape of the dot plot shows a clear pattern: as judges see more rounds, they cluster more tightly around the mean [.4840, according to A Numbers Game], and the plot narrows at the top. This is as it should be: the more rounds you judge, the less likely you get lots and lots of strong aff. teams (or weak aff. teams) by chance. But how to quantify this pattern and ask exactly how it compares to what we'd expect to see by chance?

What makes this challenging is that the sample sizes differ for each judge. Some judges judged 80 rounds, some two. Normally, you'd expect to have a standard sample size. So I calculated the probabilities at various sample sizes, specifically, at a six-round increment to make the lines. The green lines mark the 68% confidence interval (roughly 1 standard deviation above and below the mean ); the yellow lines the 95% confidence interval (roughly 2 standard deviations); and the red lines mark the 99.8% confidence interval (roughly 3 s.d.s). This is an unusual method, so let me explain what I think it shows. If every judge saw 36 rounds, then we would expect 68% of those judges voted aff. between 40 and 60% of the time (you can trace to see that the green lines do indeed pass through these points). So far, so good. My somewhat original assumption is that you can add these slices up: since for x rounds judged, 68% of those judges in that horizontal slice should be between the green lines, then for all the judges in the whole vertical column, 68% should be between the green lines.

It turns out a little worse than expected, at 33.5%, but not by much. And, as A Numbers Game has shown, this is about the same for the other years, too.

Friday, August 7, 2009

Another way to visualize strength-of-schedule pairings

I've written about a type of pairing I call a strength-of-schedule pairing. The basic idea is that a team that has had a weak schedule so far (measured by opp. wins, opp. speaks, etc.) gets a strong opponent for the next round. (Of course, all of this is within a bracket, so weak and strong are relative to the average team strength and schedule strength in that bracket.) It sounds simple enough, but it has to work both ways -- both teams get the opponent they deserve in each other.

It hit me how to help people visualize how this pairing method differs from the traditional. First, picture a Cartesian grid, depicting one bracket, like so:


A position along the x-axis shows a team's strength (say, speaker points) above or below the average, 0, of the bracket. (If it helps, you can think of these as standard deviations above or below the mean; or, you can think of these as speaker points above or below 28.) A position along the y-axis shows a difficulty of a team's schedule above or below the average. Quadrant I contains the good teams in the bracket that have had tough schedules; quadrant IV contains the good teams that have had easy schedules.

Ideally, you want debaters from quadrant I (strong teams, strong schedules) to face debaters from quadrant III (weak teams, weak schedules), debaters from quadrant II (weak teams, strong schedules) to face each other, and debaters from quadrant IV (strong teams, weak schedules) to face each other -- to even everything out. Let's look at how the two traditional methods fare. I generated 16 random points on the grid, and based on their scores, power-matched them using these two methods.

First, high-high-pairings:


The best team faces the second best; third, the fourth; on down to the second worst facing the worst. Two debates are matched between quadrants I and IV. These are unfair to quadrant I teams, who are good teams, with tough schedules, facing yet another good team. There are four debates between quadrant II and III. These are unfair to quadrant III teams, who are weak teams, with easy schedules, facing yet another weak team. I'd say there's really only one good match: the quadrant IV team to the other quadrant IV team. All in all, many of these debates are likely to exacerbate the range of schedule strength teams face. Score: 1/8.

Second, let's look at the high-low pairing method, using the same 16 points:


The best team debates the lowest, then the second best debates the second worst, and so on. This is not much of an improvement in terms of equalizing schedule strength. There are two debates between quadrant I to II -- unfair to quadrant II teams, weak teams, tough schedules, facing another good opponent. There are two debates between quadrant III and IV -- unfair to quadrant IV teams, good teams, easy schedules, facing another weak opponent. And there are two that are truly wretched: quadrant II (weak teams, tough schedules) to quadrant IV (strong teams, easy schedules) -- unfair to both quadrant II and IV teams!! There are really only two good matches, between quadrant I and III teams. Score: 2/8.

Here's what a strength-of-schedule pairing looks like, for the same points:


I didn't tweak it! This is what came out of my algorithm. All eight matches accord with the preferences I spelled out: Is debate IIIs, IIs debate IIs, and IVs debate IVs. I added in the dotted line to show that most of them have this rough symmetry, where the x score of one is nearly as possible -y of the other, and vice versa. Given the random distribution of the points, it's pretty darn good. All in all, this will equalize as much as possible the schedule strength faced by each team in this bracket. Tough schedule? Weak opponent. Easy schedule? Strong opponent. Score: 8/8.

Tuesday, August 4, 2009

Network graph theory and debate tournaments

I recently saw A Numbers Game's charts of debate tournaments, which look like modified network graphs, and it inspired me to post about some thoughts I'd had a few years ago about graph theory and debate tournaments. (Network) graph theory is a fascinating branch of mathematics, and it is directly applicable to debate tournaments. Basically, a network graph is anything that maps all the pathways (edges) between some nodes (vertices). A network graph of a debate tournament shows the match-ups (edges) between the teams at the tournament (vertices). An edge without an arrow would just show a match-up between two teams; an edge with an arrow would indicate the winner of it. A network graph can show the results of all six (or eight or whatever) preliminary rounds simultaneously. You can see a lot of interesting patterns that would be hard to notice any other way; I've used network graphs as a way to illustrate how a tournament played out.

You might think, since graphs are a complete mapping of all the preliminary round results, that they could be used to rank all the teams at a tournament from first seed to worst. However, if you tried to use a graph this way, you'd run into three different problems. Graphs are best for illustrative purposes (and one alternative use I'll suggest at the end).

The first problem with trying to use a graph to seed the teams is that you still need a lot of tiebreakers. Here's a simple example:



As indicated by the graph, team A beat B and C, and teams B and C both beat D. The problem is that B and C can't be ranked against each other from this information alone. Given the results, the final rankings must contain the sequences {A, B, D} and {A, C, D}, but both {A, B, C, D} and {A, C, B, D} satisfy this. The problem is that, except for round robins, there will always be teams that didn't meet at the tournament, and thus, pairs of vertices without edges.

Ok, so, you still need tiebreakers, but a network graph as a ranking mechanism presents a second, bigger problem: "contradictions." Let's say we run a really small tournament, and we must force A and B to debate twice. Here's one possible outcome that creates a contradiction:



A (aff.) beat B, but B (aff.) beat A. If we try to use the network graph to rank these two teams, we'd be forced to give two contradictory statements. (It's worth remembering that a team's strength is not invariant: teams can be better on one side than the other, they can have off rounds or lucky rounds, and judges also play a factor.) This is simple enough to resolve: we can use speaker points to rank one team higher than the other. Here's a slightly more complex version of the same problem:



A beat B (rd 1), B beat C (rd 2), and C beat A (rd 3). There's a contradiction here also. It will have to be resolved by "upsetting" one result, that is, a team must be ranked below an opponent whom they beat. For example, a final ranking might be {A, B, C}, which means that C's victory over A was "upset" or "overruled" because A had more overall wins, better speaker points, etc., than B or C. Bear in mind that doing a graph didn't create this problem; it merely exposed it. We're mostly oblivious to it because cume sheets show wins, points, etc., and don't provide a network graph. Look carefully through the results of nearly any tournament, and I'd be willing to beat you can find at least a few teams who are ranked below (based on overall record, speaks, etc.) an opponent they beat. Here's another example:



Any such type of contradiction, no matter how many teams, is known as a simple cyclical graph, if there's only one cyclical pathway. The good news is that, however long a simple cyclical graph, only one result has to be upset: {A, B, C, D} only upsets D's victory over A. (A complex cyclical graph would contain multiple, intersecting cycles and be a much bigger headache, but my hunch is that they are extremely rare.)

The third problem with using a network graph as a ranking mechanism is perhaps its biggest, if not mathematically, then for the debate community to stomach:



Notice the problem? There are no cycles; everything flows in only one direction. The rankings, according to the graph, are unambiguously {A, B, C, D, E, F}. Here it is again:



C beat D, but at any tournament, D would be ranked higher, because C is 1-2 and D is 2-1. In the situation I graphed, C is the better team but had a tougher schedule. (It's also possible to create an example in which D is the better team but just had an off-round or a crazy judge.) There's no cycle in the graph, so there's no need to have an upset, but I believe every debater and coach would say D should be ranked higher. Total wins as the first method of ranking is sacrosanct to the debate community, and with good reason. Total wins is clear and unambiguous; a network graph is abstract and complex. I know I would never want a tournament to start interpreting results with a network graph; it would create too much room for error when deciding breaks.

This last problem is insurmountable. I think the lesson is that we need to take every effort to make schedule strength more equitable, but there's no mathematical way to jigger results to somehow weight or re-interpret an unfair set of pairings. The cat's out of the bag, and we just have to say, "Sorry, tournaments aren't always fair," at that point. The bottom line is that network graphs will never and ought not be used for tournament rankings. But there's is an alternative use for graph theory: as a tiebreaker at round robins.

Let's say a six-team round robin has these final results:

Team Record
A 4-1
B 3-2
C 3-2
D 3-2
E 2-3
F 0-5

There's a three-way tie for second place. My suggestion is that round robins can use an idea from graph theory as the first tiebreaker, before resorting to speaker points. I believe that direct results ought to be respected as much as possible, and in round robins, you have direct results for every head-to-head match up. Who cares that team C spoke prettier than B or D if C lost both of those rounds? Here's the graph of this hypothetical round robin:



Now, you may notice that C beat A, which suggests that C is the best of the three, or you may focus on the fact that D beat both B and C, which suggests that D is the best. Both may catch your eye, but there's a way to quantify an exact result. But first, you need to turn the graph into an incidence matrix, like so:



A column represents a team's wins; a row, its losses. Hence, reading down column A, you see that team A beat B, D, E, and F; reading across column A, you see that team A lost to C. The incidence matrix contains all the information that the graph does; it's merely another representation of the same data. You might also notice that there's an easy way to double-check the matrix:



You can add to make sure you have the right number of wins in the column and losses in the row for each team. (The incidence matrix is easy to do for multi-ballot round robins; just put the ballots won into the correct spot for each team.)

Now, we need to test possible rankings. There are six possibilities created by the three-way tie: {A, B, C, D, E, F}, {A, B, D, C, E, F}, {A, C, B, D, E, F}, {A, C, D, B, E, F}, {A, D, B, C, E, F}, and {A, D, C, B, E, F}. We create an incidence matrix for each possible ranking, a matrix of the hypothetical results that perfectly consistent the ranking order with no upsets, ties, or ambiguity. For example, for the possible ranking {A, B, C, D, E, F}, the matrix is:



The matrix shows results that would be perfectly consistent with the ranking. Now we can calculate a score for how well this ranking corresponds to the actual tournament results:



Subtract the possible ranking matrix [R] from the actual tournament results [T] and count the -1s. (You can count the +1s also; they are symmetrical.) According to this ranking, there were four "upsets": that C actually beat A, that D actually beat B, that D actually beat C, and that E actually beat D. If a tournament accepts the ranking {A, B, C, D, E, F}, it "overturns" four direct results.

How do the other rankings compare? Using the same method for each "prediction," the rankings can be compared:

Rank order "Upsets"
{A, B, C, D, E, F} 4
{A, B, D, C, E, F} 3
{A, C, B, D, E, F} 5
{A, C, D, B, E, F} 4
{A, D, B, C, E, F} 2
{A, D, C, B, E, F} 3

The best ranking is obviously {A, D, B, C, E, F}. It upsets the fewest direct results. In fact, the two that are upset make perfect sense:



that C (3-2) actually beat A (4-1) and that E (2-3) actually beat D (3-2). In a sense, these were unavoidable (or you might even say real) upsets, not artifacts created by a poor ranking. Anyway, the point is, the three-way tie could be broken in this case without resorting to speaker points. This algorithm would be relatively easy to program into a tab program for round robins.

Of course, some ties are cannot be resolved by this method, namely, if the ties are really contradictions created by cycles. There's still a place for speaker points and other tiebreakers.

Saturday, June 20, 2009

Have you ever heard of Plato, Aristotle, Socrates?

Aristotle wrote one of the earliest systemizations of logic, his syllogisms. Unfortunately, he made two tiny oversights. An easy and fun logic lesson with students is to use Venn diagrams to check each syllogism, like so:


A full list of the syllogisms is in this document. Thanks to Keith Devlin for the idea.