As you read in the last post, Gemini ("gemini-1.5-flash-latest"
) delivered a mixed performance in Q&A and knowledge testing tasks. I decided to also try and compare OpenAI. The OpenAI API is more mature and easier to understand than the last time I tried it.
I developed the following testing protocol.
Knowledge Base
I expanded my corpus on Engineering Material properties to 50 questions and answers. It covers engineering properties and some information on steels. I need to add information on non-ferrous materials, polymers, and composites. But for the purpose of this exercise, 50 is probably sufficient. I will keep it frozen as my test corpus.
The Goal
The overall testing goal is to compare the accuracy, consistency, and instructional quality of different LLM models when:
Giving answers based only on the Knowledge Base.
Asking and grading T/F questions defined by me.
Asking and grading T/F questions created by the model
Inserting deliberate errors and assessing the student’s ability to notice the errors.
Which LLMs
I will start with models that I can access on two APIs: Gemini and OpenAI. I will use the default parameter settings, e.g. the temperature. I will record their values for reproducibility.
T1 - Answering Student Questions
Test Set
I will prepare a fixed list of 20 student-like questions derived or paraphrased from the corpus. I do not use the entire corpus set to have the option to retest with modified prompts and improvements in my code. The test set should include:
Slightly rephrased queries
Edge cases (possible ambiguity but not questions that combine information from multiple Q&A pairs)
Metrics
Accuracy (correct/incorrect based on the knowledge base).
Overreach rate (the number of answers where the model introduces hallucinated or outside information).
Not Yet
When I later repeat these tests, I will add a metric for "helpfulness" or "pedagogical value" of responses
In this first series of tests, I will not include questions that combine information from multiple Q&A pairs in the knowledge base
T2 - Asking and Grading T/F Questions
Test Set
I will create T/F questions from the same Q&A list used in T1 with equal number of True and False statements.
I will pick correct answers.
I will compose the “student answers”. I will prepare three sets:
All accurate answers
All False answers
Odd numbers accurate, even numbers false
The LLM will ask the questions one at a time and will mark the answers
The LLM will give a mark out of 20.
Metrics
The following netrics will be calculated for each of the three answering scenarios.
Marking accuracy (Model’s Mark/Accurate mark)
Consistency - Compare the Marking Accuracy for T’s against F’s
Not Yet
When I later repeat these tests, I will add a metric for "helpfulness" or "pedagogical value" of responses
In this first series of tests, I will not include questions that combine information from multiple Q&A pairs in the knowledge base
T3 - Creating and Grading T/F Questions
Test Set
The LLM on its own extracts T/F statements, one from each T1 question.
I discard statements that are irrelevant or unanswerable
I will pick correct answers for the valid T/F statements
The model will mark out of N, where N is the number of valid questions.
Metrics
Question generation quality (mark each question manually out of three and sum them up)
0: Invalid question
1: Low Quality (related to the T1 question but no human would choose to ask such a T/F question)
2: Medium quality (included in the knowledge base. and askable, but a better T/F statement can be crated from that T1 question
3: Good quality (as if done by a human tutor)
Marking accuracy (Model’s Mark/Accurate mark).
Consistency
Marking Accuracy for T’s against F’s
Marking Accuracy versus the Question Generation Quality
Not Yet
When I later repeat these tests, I will add a metric for "helpfulness" or "pedagogical value" of responses
T4 - Error insertion and Detection
Test Set
LLM prepares 10 statements from the T1 set with a known deliberate error introduced in each statement
I review and discard statements that are irrelevant or unanswerable
I will Compose the “answers”, wrongly for odd-numbered questions, accurately otherwise.
The model assesses the answers and marks out of N, where N is the number of valid statements.
Metrics
Error quality (I will give a mark of 0 to 3 to each LLM statement using a rubric similar to the one above for T3)
Accuracy of assessing the student answer: The number of correct assessment divided by N
Not Yet
When I later repeat these tests, I will add a metric for "helpfulness" or "pedagogical value" of responses
Conclusion
I bellieve this protocol is unambiguous, well-defined and comprehensive.. It addresses the key functionalities I want to test and has clear metrics for each test type.
This week I will start implementing this protocol to test Gemini and OpenAI LLMs.