A Numbers Game did some interesting work on the side bias of various college topics, here, specifically, controlling for team strength. In that spirit, I decided to use a different method and see how the results compared.

I used a matched pairs method: for each team, there is a matched pair of results: that team's win percentage on the affirmative, and that team's win percentage on the negative. If the two results show no difference, then the team did equally well (or equally poorly) on both sides of the topic. If the two results do show a difference, there are three possible explanations: (1) the team isn't equally strong on both sides of the topic, e.g., the 2A isn't as good as the 2N; (2) the team hit an unequal set of opponents on the two sides; or (3) there is side bias on the topic. When one looks at all the teams on a topic, (1) is unlikely because the whole point is to control for team strength by assuming that the average team is equally strong on either side, (2) cancels out when one looks at the entire pool, and (3) is left as the most plausible outcome. Although this method still relies on the assumption of invariant strength (that a team has a fixed strength, the same on both sides of the topic, unchanging throughout the year), so does any other method that attempts to control for team strength. With those disclaimers, here are the results:

The third column shows the mean of the matched pairs computation: for each team, I subtracted its negative win percentage from its affirmative win percentage, and I averaged this score over all the teams that year. The fourth column shows a calculated (not the actual) affirmative win rate. They compare closely to the actual rates A Numbers Game already found. The two that are highlighted differ slightly. The affirmative win rate for the China topic my analysis suggests is slightly lower than the actual rate. The affirmative win rate for the courts topic my analysis suggests that the negative had an advantage, while the actual rate showed an affirmative advantage.

Category 1 is roughly the 0-50th percentile (in terms of rounds of competition); category 2 is 50-75th; category 3 is 75-87th; category 4 is 87-94th; and category 5 is 94-100th. You can see the results clearly in both: the less experienced teams had greater success on the negative; the more experienced teams did better (relatively or absolutely) on the affirmative. The reason why the originally calculated affirmative win rate was too low was because there are so many more less experienced teams that bring down the average matched comparison -- but they debate few rounds, so they do not have a big effect on the total ballot count. (I looked at the same tables for other years, but there were no patterns as clear as '05-06 and '06-07.)

One final note: All of the whole years' analyses are significant to at least the 95% level except for '06-07, which is only significant at about the 85% level. I used a 1-sample t-test, since the distributions are more or less normal:

(This is '08-09, but all the distributions look like this.) In fact, the distribution looks a little tighter than the normal curve (given the population's mean and standard deviation), since so many teams have 0% aff-neg win spread. The cumulative frequency graph is even more persuasive:

The blue line represents a normal CDF, given the population's ('08-09) mean and standard deviation. Anything below the line on the left or above the line on the right is tighter than normal. The reason is that the standard deviation is pulled way out by the teams that competed for few rounds (who had very high variability in the spread, from -1 to 1). For '08-09 for example, the standard deviation for the whole population is 0.31; excluding teams with fewer than 9 rounds experience, 0.22.

Based on A Numbers Game's question, I created a Lorenz curve for '08-09:

A few data points in words:

The top 5% of teams debated 19% of all rounds.

The top 10% of teams debated 34% of all rounds.

The top 20% of teams debated 54% of all rounds.

The top half of teams debated 83% of all rounds.

Because of the heavy weighting of rounds debated by less active teams compared to rounds debated by more active teams, I'm not sure how you did your statistical test. Do you expect the aff win percentages to follow a nearly normal distribution? I'm not sure that would be justified, given the number of teams with with few rounds.

ReplyDeleteI used a 1-sample t test. My thought is that the sample has less variability (because of the very experienced teams) than the calculated s.d., which means that the confidence levels are too low, not too high. Or do I have that backwards?

ReplyDeleteAs I understand it, the t test expects the sample to be normally distributed. I don't think it is valid otherwise.

ReplyDeleteWhat do you mean "less variability than the calculated s.d."? What is "variability"? Is it the same as variance? Which calculated s.d.? The sample standard deviation of the side win percentage difference?

The variance of the sample of aff win percentages is probably decreased by the teams with a large number of rounds. I think this doesn't make the t test accurate. It will, I think, make the expected distribution less normal.

I might be wrong, but I don't think you can do a t test as an underestimate of statistical significance of a non-normal statistic.

Thanks for the clarification above. I think I wasn't very clear in expressing my concern. I wasn't convinced that even the expected (rather than the actual) distribution of the statistic "per-team difference of aff and neg win percentages" would be normally distributed.

ReplyDeleteWhether the expected distribution would be normally distributed or not, the actual distribution is very close to normal, so a t test seems like a pretty good way to check for side bias.

Another question your analysis inspires me to ask is "Who debates many rounds?" Is it only the best teams? What does the Lorenz curve look like? What if you correct for elim depth, or geography, or budget? Do good teams at schools that already have a few better teams get fewer rounds because they don't get to travel as much? Would those teams get more rounds if they went to a school wither a weaker program?

Fascinating questions. I went ahead and added a Lorenz curve. I haven't done the other analyses yet, but they are very interesting.

ReplyDelete