P3 - An LLM Tutor Testing Protocol

A protocol to test OpenAI and Gemini models

May 06, 2025

As you read in the last post, Gemini ("gemini-1.5-flash-latest") delivered a mixed performance in Q&A and knowledge testing tasks. I decided to also try and compare OpenAI. The OpenAI API is more mature and easier to understand than the last time I tried it.

I developed the following testing protocol.

Knowledge Base

I expanded my corpus on Engineering Material properties to 50 questions and answers. It covers engineering properties and some information on steels. I need to add information on non-ferrous materials, polymers, and composites. But for the purpose of this exercise, 50 is probably sufficient. I will keep it frozen as my test corpus.

The Goal

The overall testing goal is to compare the accuracy, consistency, and instructional quality of different LLM models when:

Giving answers based only on the Knowledge Base.
Asking and grading T/F questions defined by me.
Asking and grading T/F questions created by the model
Inserting deliberate errors and assessing the student’s ability to notice the errors.

Which LLMs

I will start with models that I can access on two APIs: Gemini and OpenAI. I will use the default parameter settings, e.g. the temperature. I will record their values for reproducibility.

T1 - Answering Student Questions

Test Set

I will prepare a fixed list of 20 student-like questions derived or paraphrased from the corpus. I do not use the entire corpus set to have the option to retest with modified prompts and improvements in my code. The test set should include:

Slightly rephrased queries
Edge cases (possible ambiguity but not questions that combine information from multiple Q&A pairs)

Metrics

Accuracy (correct/incorrect based on the knowledge base).
Overreach rate (the number of answers where the model introduces hallucinated or outside information).

Not Yet

When I later repeat these tests, I will add a metric for "helpfulness" or "pedagogical value" of responses
In this first series of tests, I will not include questions that combine information from multiple Q&A pairs in the knowledge base

Test Set

I will create T/F questions from the same Q&A list used in T1 with equal number of True and False statements.
I will pick correct answers.
I will compose the “student answers”. I will prepare three sets:
- All accurate answers
- All False answers
- Odd numbers accurate, even numbers false
The LLM will ask the questions one at a time and will mark the answers
The LLM will give a mark out of 20.

Metrics

The following netrics will be calculated for each of the three answering scenarios.

Marking accuracy (Model’s Mark/Accurate mark)
Consistency - Compare the Marking Accuracy for T’s against F’s

Not Yet

When I later repeat these tests, I will add a metric for "helpfulness" or "pedagogical value" of responses
In this first series of tests, I will not include questions that combine information from multiple Q&A pairs in the knowledge base

T3 - Creating and Grading T/F Questions

Test Set

The LLM on its own extracts T/F statements, one from each T1 question.
I discard statements that are irrelevant or unanswerable
I will pick correct answers for the valid T/F statements
The model will mark out of N, where N is the number of valid questions.

Metrics

Question generation quality (mark each question manually out of three and sum them up)
- 0: Invalid question
- 1: Low Quality (related to the T1 question but no human would choose to ask such a T/F question)
- 2: Medium quality (included in the knowledge base. and askable, but a better T/F statement can be crated from that T1 question
- 3: Good quality (as if done by a human tutor)
Marking accuracy (Model’s Mark/Accurate mark).
Consistency
- Marking Accuracy for T’s against F’s
- Marking Accuracy versus the Question Generation Quality

Not Yet

When I later repeat these tests, I will add a metric for "helpfulness" or "pedagogical value" of responses

T4 - Error insertion and Detection

Test Set

LLM prepares 10 statements from the T1 set with a known deliberate error introduced in each statement
I review and discard statements that are irrelevant or unanswerable
I will Compose the “answers”, wrongly for odd-numbered questions, accurately otherwise.
The model assesses the answers and marks out of N, where N is the number of valid statements.

Metrics

Error quality (I will give a mark of 0 to 3 to each LLM statement using a rubric similar to the one above for T3)
Accuracy of assessing the student answer: The number of correct assessment divided by N

Not Yet

When I later repeat these tests, I will add a metric for "helpfulness" or "pedagogical value" of responses

Conclusion

I bellieve this protocol is unambiguous, well-defined and comprehensive.. It addresses the key functionalities I want to test and has clear metrics for each test type.

This week I will start implementing this protocol to test Gemini and OpenAI LLMs.

Teaching by using LLMs

Discussion about this post

Teaching by using LLMs

P3 - An LLM Tutor Testing Protocol

A protocol to test OpenAI and Gemini models

Knowledge Base

The Goal

Which LLMs

T1 - Answering Student Questions

Test Set

Metrics

Not Yet

T2 - Asking and Grading T/F Questions

Test Set

Metrics

Not Yet

T3 - Creating and Grading T/F Questions

Test Set

Not Yet

T4 - Error insertion and Detection

Not Yet

Conclusion

Discussion about this post