Tuesday, August 25, 2009

Judge side bias -- the whole field

I've been intrigued by A Numbers Game's information about judge side bias. Rather than looking at particular judges, I wanted to look at the distribution of all the judges. I plotted the (prelim) judging records of all the college tournaments in '04-05 (data from Bruschke's Debateresults.com)



The vertical axis shows how many prelim rounds a judge judged; the horizontal axis shows the percentage of aff wins a judge gave. As you can see, the shape of the dot plot shows a clear pattern: as judges see more rounds, they cluster more tightly around the mean [.4840, according to A Numbers Game], and the plot narrows at the top. This is as it should be: the more rounds you judge, the less likely you get lots and lots of strong aff. teams (or weak aff. teams) by chance. But how to quantify this pattern and ask exactly how it compares to what we'd expect to see by chance?

What makes this challenging is that the sample sizes differ for each judge. Some judges judged 80 rounds, some two. Normally, you'd expect to have a standard sample size. So I calculated the probabilities at various sample sizes, specifically, at a six-round increment to make the lines. The green lines mark the 68% confidence interval (roughly 1 standard deviation above and below the mean ); the yellow lines the 95% confidence interval (roughly 2 standard deviations); and the red lines mark the 99.8% confidence interval (roughly 3 s.d.s). This is an unusual method, so let me explain what I think it shows. If every judge saw 36 rounds, then we would expect 68% of those judges voted aff. between 40 and 60% of the time (you can trace to see that the green lines do indeed pass through these points). So far, so good. My somewhat original assumption is that you can add these slices up: since for x rounds judged, 68% of those judges in that horizontal slice should be between the green lines, then for all the judges in the whole vertical column, 68% should be between the green lines.

It turns out a little worse than expected, at 33.5%, but not by much. And, as A Numbers Game has shown, this is about the same for the other years, too.

Friday, August 7, 2009

Another way to visualize strength-of-schedule pairings

I've written about a type of pairing I call a strength-of-schedule pairing. The basic idea is that a team that has had a weak schedule so far (measured by opp. wins, opp. speaks, etc.) gets a strong opponent for the next round. (Of course, all of this is within a bracket, so weak and strong are relative to the average team strength and schedule strength in that bracket.) It sounds simple enough, but it has to work both ways -- both teams get the opponent they deserve in each other.

It hit me how to help people visualize how this pairing method differs from the traditional. First, picture a Cartesian grid, depicting one bracket, like so:


A position along the x-axis shows a team's strength (say, speaker points) above or below the average, 0, of the bracket. (If it helps, you can think of these as standard deviations above or below the mean; or, you can think of these as speaker points above or below 28.) A position along the y-axis shows a difficulty of a team's schedule above or below the average. Quadrant I contains the good teams in the bracket that have had tough schedules; quadrant IV contains the good teams that have had easy schedules.

Ideally, you want debaters from quadrant I (strong teams, strong schedules) to face debaters from quadrant III (weak teams, weak schedules), debaters from quadrant II (weak teams, strong schedules) to face each other, and debaters from quadrant IV (strong teams, weak schedules) to face each other -- to even everything out. Let's look at how the two traditional methods fare. I generated 16 random points on the grid, and based on their scores, power-matched them using these two methods.

First, high-high-pairings:


The best team faces the second best; third, the fourth; on down to the second worst facing the worst. Two debates are matched between quadrants I and IV. These are unfair to quadrant I teams, who are good teams, with tough schedules, facing yet another good team. There are four debates between quadrant II and III. These are unfair to quadrant III teams, who are weak teams, with easy schedules, facing yet another weak team. I'd say there's really only one good match: the quadrant IV team to the other quadrant IV team. All in all, many of these debates are likely to exacerbate the range of schedule strength teams face. Score: 1/8.

Second, let's look at the high-low pairing method, using the same 16 points:


The best team debates the lowest, then the second best debates the second worst, and so on. This is not much of an improvement in terms of equalizing schedule strength. There are two debates between quadrant I to II -- unfair to quadrant II teams, weak teams, tough schedules, facing another good opponent. There are two debates between quadrant III and IV -- unfair to quadrant IV teams, good teams, easy schedules, facing another weak opponent. And there are two that are truly wretched: quadrant II (weak teams, tough schedules) to quadrant IV (strong teams, easy schedules) -- unfair to both quadrant II and IV teams!! There are really only two good matches, between quadrant I and III teams. Score: 2/8.

Here's what a strength-of-schedule pairing looks like, for the same points:


I didn't tweak it! This is what came out of my algorithm. All eight matches accord with the preferences I spelled out: Is debate IIIs, IIs debate IIs, and IVs debate IVs. I added in the dotted line to show that most of them have this rough symmetry, where the x score of one is nearly as possible -y of the other, and vice versa. Given the random distribution of the points, it's pretty darn good. All in all, this will equalize as much as possible the schedule strength faced by each team in this bracket. Tough schedule? Weak opponent. Easy schedule? Strong opponent. Score: 8/8.

Tuesday, August 4, 2009

Network graph theory and debate tournaments

I recently saw A Numbers Game's charts of debate tournaments, which look like modified network graphs, and it inspired me to post about some thoughts I'd had a few years ago about graph theory and debate tournaments. (Network) graph theory is a fascinating branch of mathematics, and it is directly applicable to debate tournaments. Basically, a network graph is anything that maps all the pathways (edges) between some nodes (vertices). A network graph of a debate tournament shows the match-ups (edges) between the teams at the tournament (vertices). An edge without an arrow would just show a match-up between two teams; an edge with an arrow would indicate the winner of it. A network graph can show the results of all six (or eight or whatever) preliminary rounds simultaneously. You can see a lot of interesting patterns that would be hard to notice any other way; I've used network graphs as a way to illustrate how a tournament played out.

You might think, since graphs are a complete mapping of all the preliminary round results, that they could be used to rank all the teams at a tournament from first seed to worst. However, if you tried to use a graph this way, you'd run into three different problems. Graphs are best for illustrative purposes (and one alternative use I'll suggest at the end).

The first problem with trying to use a graph to seed the teams is that you still need a lot of tiebreakers. Here's a simple example:



As indicated by the graph, team A beat B and C, and teams B and C both beat D. The problem is that B and C can't be ranked against each other from this information alone. Given the results, the final rankings must contain the sequences {A, B, D} and {A, C, D}, but both {A, B, C, D} and {A, C, B, D} satisfy this. The problem is that, except for round robins, there will always be teams that didn't meet at the tournament, and thus, pairs of vertices without edges.

Ok, so, you still need tiebreakers, but a network graph as a ranking mechanism presents a second, bigger problem: "contradictions." Let's say we run a really small tournament, and we must force A and B to debate twice. Here's one possible outcome that creates a contradiction:



A (aff.) beat B, but B (aff.) beat A. If we try to use the network graph to rank these two teams, we'd be forced to give two contradictory statements. (It's worth remembering that a team's strength is not invariant: teams can be better on one side than the other, they can have off rounds or lucky rounds, and judges also play a factor.) This is simple enough to resolve: we can use speaker points to rank one team higher than the other. Here's a slightly more complex version of the same problem:



A beat B (rd 1), B beat C (rd 2), and C beat A (rd 3). There's a contradiction here also. It will have to be resolved by "upsetting" one result, that is, a team must be ranked below an opponent whom they beat. For example, a final ranking might be {A, B, C}, which means that C's victory over A was "upset" or "overruled" because A had more overall wins, better speaker points, etc., than B or C. Bear in mind that doing a graph didn't create this problem; it merely exposed it. We're mostly oblivious to it because cume sheets show wins, points, etc., and don't provide a network graph. Look carefully through the results of nearly any tournament, and I'd be willing to beat you can find at least a few teams who are ranked below (based on overall record, speaks, etc.) an opponent they beat. Here's another example:



Any such type of contradiction, no matter how many teams, is known as a simple cyclical graph, if there's only one cyclical pathway. The good news is that, however long a simple cyclical graph, only one result has to be upset: {A, B, C, D} only upsets D's victory over A. (A complex cyclical graph would contain multiple, intersecting cycles and be a much bigger headache, but my hunch is that they are extremely rare.)

The third problem with using a network graph as a ranking mechanism is perhaps its biggest, if not mathematically, then for the debate community to stomach:



Notice the problem? There are no cycles; everything flows in only one direction. The rankings, according to the graph, are unambiguously {A, B, C, D, E, F}. Here it is again:



C beat D, but at any tournament, D would be ranked higher, because C is 1-2 and D is 2-1. In the situation I graphed, C is the better team but had a tougher schedule. (It's also possible to create an example in which D is the better team but just had an off-round or a crazy judge.) There's no cycle in the graph, so there's no need to have an upset, but I believe every debater and coach would say D should be ranked higher. Total wins as the first method of ranking is sacrosanct to the debate community, and with good reason. Total wins is clear and unambiguous; a network graph is abstract and complex. I know I would never want a tournament to start interpreting results with a network graph; it would create too much room for error when deciding breaks.

This last problem is insurmountable. I think the lesson is that we need to take every effort to make schedule strength more equitable, but there's no mathematical way to jigger results to somehow weight or re-interpret an unfair set of pairings. The cat's out of the bag, and we just have to say, "Sorry, tournaments aren't always fair," at that point. The bottom line is that network graphs will never and ought not be used for tournament rankings. But there's is an alternative use for graph theory: as a tiebreaker at round robins.

Let's say a six-team round robin has these final results:

Team Record
A 4-1
B 3-2
C 3-2
D 3-2
E 2-3
F 0-5

There's a three-way tie for second place. My suggestion is that round robins can use an idea from graph theory as the first tiebreaker, before resorting to speaker points. I believe that direct results ought to be respected as much as possible, and in round robins, you have direct results for every head-to-head match up. Who cares that team C spoke prettier than B or D if C lost both of those rounds? Here's the graph of this hypothetical round robin:



Now, you may notice that C beat A, which suggests that C is the best of the three, or you may focus on the fact that D beat both B and C, which suggests that D is the best. Both may catch your eye, but there's a way to quantify an exact result. But first, you need to turn the graph into an adjacency matrix, like so:



A column represents a team's wins; a row, its losses. Hence, reading down column A, you see that team A beat B, D, E, and F; reading across column A, you see that team A lost to C. The adjacency matrix contains all the information that the graph does; it's merely another representation of the same data. You might also notice that there's an easy way to double-check the matrix:



You can add to make sure you have the right number of wins in the column and losses in the row for each team. (The adjacency matrix is easy to do for multi-ballot round robins; just put the ballots won into the correct spot for each team.)

Now, we need to test possible rankings. There are six possibilities created by the three-way tie: {A, B, C, D, E, F}, {A, B, D, C, E, F}, {A, C, B, D, E, F}, {A, C, D, B, E, F}, {A, D, B, C, E, F}, and {A, D, C, B, E, F}. We create an adjacency matrix for each possible ranking, a matrix of the hypothetical results that perfectly consistent the ranking order with no upsets, ties, or ambiguity. For example, for the possible ranking {A, B, C, D, E, F}, the matrix is:



The matrix shows results that would be perfectly consistent with the ranking. Now we can calculate a score for how well this ranking corresponds to the actual tournament results:



Subtract the possible ranking matrix [R] from the actual tournament results [T] and count the -1s. (You can count the +1s also; they are symmetrical.) According to this ranking, there were four "upsets": that C actually beat A, that D actually beat B, that D actually beat C, and that E actually beat D. If a tournament accepts the ranking {A, B, C, D, E, F}, it "overturns" four direct results.

How do the other rankings compare? Using the same method for each "prediction," the rankings can be compared:

Rank order "Upsets"
{A, B, C, D, E, F} 4
{A, B, D, C, E, F} 3
{A, C, B, D, E, F} 5
{A, C, D, B, E, F} 4
{A, D, B, C, E, F} 2
{A, D, C, B, E, F} 3

The best ranking is obviously {A, D, B, C, E, F}. It upsets the fewest direct results. In fact, the two that are upset make perfect sense:



that C (3-2) actually beat A (4-1) and that E (2-3) actually beat D (3-2). In a sense, these were unavoidable (or you might even say real) upsets, not artifacts created by a poor ranking. Anyway, the point is, the three-way tie could be broken in this case without resorting to speaker points. This algorithm would be relatively easy to program into a tab program for round robins.

Of course, some ties are cannot be resolved by this method, namely, if the ties are really contradictions created by cycles. There's still a place for speaker points and other tiebreakers.