LM Arena: An Open Testing Ground for AI Models

If you work with AI in education, you've probably wondered: Which model actually performs best for my use case?

Marketing claims and technical benchmarks rarely answer that question. A model might score impressively on standardised tests while being mediocre at the thing you actually need: explaining fractions to a struggling student, generating quiz questions in Swedish, or maintaining a helpful tone without giving answers away.

LM Arena offers a way to find out for yourself. It's an open platform where you can test different AI models head-to-head, without setting up accounts or paying for API access.

How It Works

LM Arena offers three ways to interact with models:

Battle is the core experience. You enter a prompt and receive responses from two anonymous models. You don't know which models you're comparing until after you vote. This blind evaluation removes the bias that comes from knowing a response is from ChatGPT or Claude or Gemini. Your vote contributes to the public leaderboard.

Side by Side lets you choose which two models to compare. The responses aren't anonymous, you know exactly what you're testing. This is useful when you've narrowed down your options and want to see how specific models handle your particular use cases.

Direct Chat is straightforward one-on-one conversation with a single model. Handy for deeper exploration once you've identified a promising candidate.

After any interaction, you can continue the conversation or start fresh.

The Elo System: Why the Rankings Matter

Your votes contribute to a public leaderboard based on Elo ratings, the same system used to rank chess grandmasters. The principle is straightforward: when you beat a higher-rated opponent, you gain more points than beating a lower-rated one. Over time, this creates a reliable ranking based on actual head-to-head performance.

What makes LM Arena's leaderboard credible is scale. With over six million votes collected from real users testing real prompts, the rankings reflect genuine human preference rather than performance on synthetic benchmarks that models may have been optimised for.

The platform now covers multiple categories including text, vision, and coding. This recognises that "best" depends entirely on what you're trying to do.

Why This Matters for Education

When evaluating AI tools for schools, vendor claims only tell part of the story. Technical benchmarks like MMLU measure general knowledge, but they don't tell you whether a model can scaffold a concept appropriately for a 12-year-old, or whether it handles Swedish curriculum terminology naturally rather than offering awkward translations of American concepts.

LM Arena lets you test these things directly. You can see how models differ in tone, in how they follow instructions, in whether they guide students toward understanding or simply hand over answers.

For product teams building educational AI, this kind of testing belongs early in the process. Test before you've committed to an API contract or written integration code. A fifteen-minute session comparing models on your actual use cases can save months of rework.

A Note on Privacy

LM Arena is an open research project run by LMSYS, a group of researchers primarily from UC Berkeley. Everything you type is logged and may be made public to support AI research and model training.

This makes it excellent for testing capabilities but entirely unsuitable for processing real data. Never paste student information, personal details, or proprietary content into the arena. Use synthetic examples that represent your use cases without exposing anything sensitive.

The Bigger Picture

We're moving away from the idea of a single "best" model toward a landscape of specialised tools. The right choice depends on your specific context: the language, the age group, the pedagogical approach, the cost constraints.

LM Arena won't make that choice for you, but it gives you a way to explore the options with your own eyes, on your own terms, at no cost.

Worth bookmarking: lmarena.ai

How It Works

The Elo System: Why the Rankings Matter

Why This Matters for Education

A Note on Privacy

The Bigger Picture

Next Article