Santa J. Ono, Ph.D. President at University of Michigan - Ann Arbor | Official website
Santa J. Ono, Ph.D. President at University of Michigan - Ann Arbor | Official website
Faulty ranking systems in AI leaderboards can distort the perceived performance of large language models, according to new research from the University of Michigan. The study examined four ranking methods commonly used in online AI leaderboards such as Chatbot Arena, as well as those found in sports and gaming.
Researchers found that the choice and implementation of a ranking method can lead to different outcomes even when using the same crowdsourced data on model performance. Based on these findings, they developed guidelines for improving how leaderboards represent AI models’ true capabilities.
“Large companies keep announcing newer and larger gen AI models, but how do you know which model is truly the best if your evaluation methods aren’t accurate or well studied?” said Lingjia Tang, associate professor of computer science and engineering and a co-corresponding author of the study.
“Society is increasingly interested in adopting this technology. To do that effectively, we need robust methods to evaluate AI for a variety of use cases. Our study identifies what makes an effective AI ranking system, and provides guidelines on when and how to use them.”
Evaluating generative AI models is challenging because assessments of their outputs can be subjective. While some leaderboards measure accuracy on specific tasks with clear answers, others like Chatbot Arena focus on open-ended output by asking people to compare responses from two random models without knowing which is which. These preferences are then stored in a database for ranking.
However, leaderboard rankings may depend heavily on how these systems are set up. For example, Chatbot Arena previously used Elo—a system popular in chess—that allows users to adjust how much wins or losses affect rankings. But settings that work for athletes may not suit AI models, since AIs don’t change over time like human competitors.
“In chess and sport matches, there’s a logical order of games that proceed as the players’ skills change over their careers. But AI models don’t change between releases, and they can instantly and simultaneously play many games,” said Roland Daynauth, U-M doctoral student in computer science and engineering and the study’s first author.
To test each rating system’s effectiveness, researchers analyzed portions of two crowdsourced datasets—one from Chatbot Arena—and checked whether each system’s rankings matched actual win rates in withheld data. They also evaluated sensitivity to user settings and logical consistency across pairwise comparisons: if A beats B and B beats C, then A should outrank C.
The Glicko system—used in e-sports—produced the most consistent results when comparisons were uneven among models. Other systems like Bradley-Terry (adopted by Chatbot Arena in December 2023) could also be accurate but only when all models had an equal number of head-to-head matchups; otherwise, they might favor newer entrants unfairly.
“Just because a model comes onto the scene and beats a grandmaster doesn’t necessarily mean it’s the best model. You need many, many games to know what the truth is,” said Jason Mars, U-M associate professor of computer science and engineering and a co-corresponding author of the study.
Elo-based rankings—as well as Markov Chains (used by Google for web search)—were highly dependent on user configuration choices. In contrast, Bradley-Terry lacks adjustable settings but works best with balanced datasets where every model has faced every other model equally often.
“There’s no single right answer, so hopefully our analysis will help guide how we evaluate the AI industry moving forward,” Tang said.
The research was supported by funding from the National Science Foundation.