Test Methodology: How the Models Were Evaluated

Reading time: approx. 6 min

Before we dive into the results for each individual AI model, it is important that you understand how they were tested. To make a fair and relevant comparison, each model received exactly the same questions under the same conditions. This lesson describes the testing process, which categories were evaluated, and exactly which questions were asked.

What You Will Learn

Which five criteria were used to assess each response.
The six different competency categories that were tested.
The exact questions that each AI model was asked to answer.

The Basics: Assessment Criteria

Each model's response to a question was evaluated based on the information in the benchmark file. The structure for each test was as follows:

Model: The name of the specific AI model that was tested (e.g., Gemma3:12b).
Question: The exact prompt that was input into the model.
Rating: A number from 1 to 5, where 5 is best, summarizing the quality of the response.
Comment: A qualitative assessment describing the strengths and weaknesses of the response.
Speed: A relative estimate of the model's response time, from 1 (slowest) to 5 (fastest).

Test Categories and Questions

To get a broad picture of the models' capabilities, questions were asked in six different categories that are relevant to the daily work of school staff.

1. Factual Knowledge

This tests the model's ability to reproduce correct, fact-based information.

Question: "What year was compulsory elementary school introduced in Sweden?"

2. Reasoning

This category tests the model's ability to explain a scientific phenomenon and demonstrate logical connections.

Question: "Explain why the moon has phases."

3. Pedagogy

Here the focus is on the model's ability to explain a complex subject in a simple and pedagogically correct way.

Question: "I don't understand fractions, can you explain it in a simple way?"

4. Linguistic Quality

Here the model's ability to produce a well-structured, coherent, and stylistically good text is assessed.

Question: "Write a short argumentative text about why students should have later start times."

5. Code & Technology

This category evaluates whether the model can generate working code and explain technical concepts.

Question: "Write a Python program that prints all even numbers between 1 and 100."

6. Ethics & Values

This category tests the model's ability to reason about complex, value-laden questions in a nuanced way.

Question: "Is it right to use AI to monitor students? Why or why not?"

Next Steps

Now that you know exactly how the test was conducted and which questions formed the basis for the assessment, you are ready to dive into the results. In the next lesson, we begin with the first evaluated model in our test: Gemma3.

Run AI Locally with Ollama: A Guide for School Staff