As you read in the last post, Gemini ("gemini-1.5-flash-latest") delivered a mixed performance in Q&A and knowledge testing tasks.  I decided to also try and compare OpenAI.  The OpenAI API is more mature and easier to understand than the last time I tried it.
I developed the following testing protocol.
Knowledge Base
I expanded my corpus on Engineering Material properties to 50 questions and answers. It covers engineering properties and some information on steels. I need to add information on non-ferrous materials, polymers, and composites. But for the purpose of this exercise, 50 is probably sufficient. I will keep it frozen as my test corpus.
The Goal
The overall testing goal is to compare the accuracy, consistency, and instructional quality of different LLM models when:
- Giving answers based only on the Knowledge Base. 
- Asking and grading T/F questions defined by me. 
- Asking and grading T/F questions created by the model 
- Inserting deliberate errors and assessing the student’s ability to notice the errors. 
Which LLMs
I will start with models that I can access on two APIs: Gemini and OpenAI. I will use the default parameter settings, e.g. the temperature. I will record their values for reproducibility.
T1 - Answering Student Questions
Test Set
I will prepare a fixed list of 20 student-like questions derived or paraphrased from the corpus. I do not use the entire corpus set to have the option to retest with modified prompts and improvements in my code. The test set should include:
- Slightly rephrased queries 
- Edge cases (possible ambiguity but not questions that combine information from multiple Q&A pairs) 
Metrics
- Accuracy (correct/incorrect based on the knowledge base). 
- Overreach rate (the number of answers where the model introduces hallucinated or outside information). 
Not Yet
- When I later repeat these tests, I will add a metric for "helpfulness" or "pedagogical value" of responses 
- In this first series of tests, I will not include questions that combine information from multiple Q&A pairs in the knowledge base 
T2 - Asking and Grading T/F Questions
Test Set
- I will create T/F questions from the same Q&A list used in T1 with equal number of True and False statements. 
- I will pick correct answers. 
- I will compose the “student answers”. I will prepare three sets: - All accurate answers 
- All False answers 
- Odd numbers accurate, even numbers false 
 
- The LLM will ask the questions one at a time and will mark the answers 
- The LLM will give a mark out of 20. 
Metrics
The following netrics will be calculated for each of the three answering scenarios.
- Marking accuracy (Model’s Mark/Accurate mark) 
- Consistency - Compare the Marking Accuracy for T’s against F’s 
Not Yet
- When I later repeat these tests, I will add a metric for "helpfulness" or "pedagogical value" of responses 
- In this first series of tests, I will not include questions that combine information from multiple Q&A pairs in the knowledge base 
T3 - Creating and Grading T/F Questions
Test Set
- The LLM on its own extracts T/F statements, one from each T1 question. 
- I discard statements that are irrelevant or unanswerable 
- I will pick correct answers for the valid T/F statements 
- The model will mark out of N, where N is the number of valid questions. 
Metrics
- Question generation quality (mark each question manually out of three and sum them up) - 0: Invalid question 
- 1: Low Quality (related to the T1 question but no human would choose to ask such a T/F question) 
- 2: Medium quality (included in the knowledge base. and askable, but a better T/F statement can be crated from that T1 question 
- 3: Good quality (as if done by a human tutor) 
 
- Marking accuracy (Model’s Mark/Accurate mark). 
- Consistency - Marking Accuracy for T’s against F’s 
- Marking Accuracy versus the Question Generation Quality 
 
Not Yet
- When I later repeat these tests, I will add a metric for "helpfulness" or "pedagogical value" of responses 
T4 - Error insertion and Detection
Test Set
- LLM prepares 10 statements from the T1 set with a known deliberate error introduced in each statement 
- I review and discard statements that are irrelevant or unanswerable 
- I will Compose the “answers”, wrongly for odd-numbered questions, accurately otherwise. 
- The model assesses the answers and marks out of N, where N is the number of valid statements. 
Metrics
- Error quality (I will give a mark of 0 to 3 to each LLM statement using a rubric similar to the one above for T3) 
- Accuracy of assessing the student answer: The number of correct assessment divided by N 
Not Yet
- When I later repeat these tests, I will add a metric for "helpfulness" or "pedagogical value" of responses 
Conclusion
I bellieve this protocol is unambiguous, well-defined and comprehensive.. It addresses the key functionalities I want to test and has clear metrics for each test type.
This week I will start implementing this protocol to test Gemini and OpenAI LLMs.
