Saturday, April 7, 2018

Experimental verification of the logit score

One method for ranking teams that I introduced to the debate community is the logit score. The logit score is derived from a logistic regression. The logit score combines a team's record, speaker points, and its opponents' strength into a single number. Because the logit score factors in record and points, it is performance-based, but that record is adjusted by opponent strength, making the logit score more fair than record alone. A win against a good team is "worth" more than a win against a weak team. If you take the worst opponent a team beat and the best opponent it lost to, and average those together along with the team's average speaker points, then you're approximating the team's logit score. Due to how the logit score is calculated, it is the likeliest team strength that explains its results: its record and its points.

I had previously looked for empirical support for the logit score in a college debate season. I took the real results for the entire season and used them to calculate each team's logit score. I then used those to retrodict the winner in every single match-up that had actually happened, with the higher ranked by logit score team retrodicted to win the round. The logit score did this better than every other ranking method I also tested, slightly edging out median speaker points, and doing better by a goodly margin than the win-loss record. Despite this success, there was the nagging concern that the logit score was being derived from an entire season's worth of information. This empirical support could not show if the logit score would work for a single tournament.

Therefore I set out to do an experiment. I created a simulation tournament in a program, and ran and re-ran it hundreds of times. I tested various tournament conditions, from random prelims to a typical method of power-matching to pre-matching (like a round robin). I looked to see whether in these kind of conditions--using only the information available in a tournament--the logit score fared as well in comparison to record-based rankings and to speaker point-based rankings.

The results are that, in any condition, the logit score is a vast improvement on the win-loss record, but not quite as good as speaker points. It may surprise people to realize that speaker points, even though they vary considerably from judge to judge, are the best information to rank teams. A team's median speaker points isn't affected too much by one judge. Speaker points are rich data when you only have six or eight rounds to rank a team.

However, I believe many in the community would not prefer to use speaker points alone. If nothing else, ignoring wins and losses gives a perverse incentive to teams to speak pretty and ignore winning key arguments. The logit score is a solid, thoughtful compromise. The logit score is based on both wins and points, so there's no perverse incentive to ignore key arguments--nor is there an incentive to ignore effective, mellifluous communication. Although the logit score is slightly less accurate for a single tournament than speaker points alone, the logit score is far more accurate than win-loss record is. The logit score is, in other words, a vast improvement on the status quo method--a compromise in name only.

No comments:

Post a Comment