What’s the most powerful artificial intelligence model at any given moment? Check the leaderboards.

Community-built rankings of AI models posted publicly online have surged in popularity in recent months, offering a real-time look at the ongoing battle among major tech companies for AI supremacy.

The number of leaderboards has surged in recent months. Each tracks which AI models are the most advanced based on their ability to complete certain tasks. An AI model at its root is the set of mathematical equations wrapped in code designed to accomplish a particular goal.

Some newer entrants, such as Google’s Gemini (formerly Bard) and Mistral-Medium from the Paris-based startup Mistral AI, have stirred excitement in the AI community and jockeyed for spots near the top of the rankings.

OpenAI’s GPT-4, however, continues to dominate.

“People care about the state of the art,” said Ying Sheng, a co-creator of one such leaderboard, Chatbot Arena, and a doctoral student in computer science at Stanford University. “I think people actually would more like to see that the leaderboards are changing. That means the game is still there and there are still more improvements to be made.”

The rankings are based on tests that determine what AI models are generally capable of, as well as which model might be most competent for a specific use, like speech recognition. The tests, also sometimes called benchmarks, measure AI performance on such metrics as how human AI audio sounds or how human an AI chatbot response appears.

The evolution of such tests is also important as AI continues to advance.

“The benchmarks aren’t perfect, but as of right now, that’s kind of the only way we have to evaluate the system,” said Vanessa Parli, the director of research at Stanford’s Institute of Human-Centered Artificial Intelligence.

The institute produces Stanford’s AI Index, an annual report that tracks the technical performance of AI models across various metrics over time. Last year’s report looked at 50 benchmarks but included only 20, Parli said, and this year’s will again shave off some older benchmarks to highlight newer, more comprehensive ones.

The leaderboards also offer a glimpse at just how many models are in development. The Open LLM (large language model) Leaderboard built by Hugging Face, an open-source machine learning platform, had evaluated and ranked more than 4,200 models as of early February, all submitted by its community members.

The models are tracked on seven key benchmarks that aim to assess a variety of capabilities, such as reading comprehension and mathematical problem-solving. The evaluations include quizzing the models on grade-school math and science questions, testing their commonsense reasoning and measuring their propensity to repeat misinformation. Some tests offer multiple-choice answers, while others ask models to generate their own answers based on prompts.