P2-Can LLM autonomously assess student knowledge?

Apr 29, 2025

There are many different assessment modes. In this post, I explore the on-line quiz mode in which the LLM asks questions, the student answers, and the LLM marks the student. Some caveats in advance:

I do not think asking questions and checking the accuracy of the answer is a good technique in an on-line situation because the students can easily game it by passing the questions and answers between the two LLMs on call.
LLM deliberately dissembling and asking the student to identify the mistake may be better. For a detailed technical course, another LLM may not be able to identify the deliberate errors.

You will see below that the LLM I am using can manage (1) in True/False questions. I have not tried it with questions asking open-ended answers from the student.

I do not think the LLM I am using is ready to do (2) yet. You will see it below.

None of this surprising. You will see I have not tried too hard with prompting and it me getter with better or multi-step prompting. One can also spend more time with corpus preparation and compose text that can be easily and accurately and meaningfully converted to knowledge maps which may make it easier to dissemble and to seek the truth.

I do not want to pursue this any further at this stage. I am more interested in using images. Before I do that however, I may also try this with the new OpenAI o3. But this is for the next post.

In this post, everything was generated thtough the Gemini API using the model:

LLM_MODEL_NAME = "gemini-1.5-flash-latest"

I do this in a Jupyter notebook initiating calls to the Gemini API. I use the google.generativeai as genai installed by # %pip install google-generativeai.

I converted the Chapter 2 Q&A JSON file to text (which I described in the previous post and can be downloaded from github by paid subscribers) and uploaded it to Gemini API using the genai.upload_file()function. I first had to convert the JSON file to text because the Gemini API cannot semantically understand a JSON file.

Me, Python, and Gemini API

The following is the interaction between my python script and the Gemini API. When I say “Me”, it means sending my text to the LLM through the python line response = llm.send_chat_message(question)and the LLM answer is printed back as response.text.

I only include the prompt strings my python program sends to the Gemini API and the response from Gemini. You can try the same thing directly talking to the Gemini App on your phone. The difference is that when you communicate with the Gemini App on your phone or through the Google AI Studio interface on your browser, you are not communicating directly with the LLM but with the “moat” around it. This moat takes your prompt and may modify it and also may choose the most appropriate LLM version available. Therefore, if the purpose of the exercise is to try things with the LLM for a possible future implementation of a program talking directly to the LLM through the API, it is best to do all testing in the same way.

Nuisance: The Gemini API documentation offers two different ways to access the Gemini API on different pages. Both work but one is the officially recommended method the other not. This is confusing. I will address it further below.

LLM in Quiz Mode

Me: The document 'chap2 json', has a series of questions and answers. Suppose you are a teacher and I am a student. You would like to test my knowledge. You would ask me a question randomly selected from the document. I would answer it and you will check the accuracy of my answer, Can you do this?

LLM: Okay, let's begin. I'll choose a question randomly from the provided text.

Question: What is the relationship between Brinell Hardness (HB) and tensile strength for steels?

Me: Brinell hardness is equal to the tensile strength.

LLM: Your answer is incorrect. While there's a relationship, it's not a direct equality.

The correct answer, as per the provided text, is: For steels, especially highly hardenable alloy steels, 0.50(HB) = approximate tensile strength (ksi). There's an approximate relationship, not a direct equivalence

I tried the Q&A mode above only to see if it works or not. Otherwise, this cannot be a formal assessment tool because it easy for a student to ask another LLM and pass its answer to the examining LLM.

LLM expanding the Q&A to a longer document

It is easy to expand the Q&A JSON file to a longer text document using the LLM. This is what I try herebelow:

Me: The document 'chap2 json', has a series of questions and answers. Please expand it to a chapter in a textbook format. The chapter should be titled 'Chapter 2: Materials for Engineering Design'. Eaxpand each Question and Answer into a paragraph. The information you add should not be in conflict with the original answer in the file.

LLM:

Chapter 2: Materials for Engineering Design

This chapter introduces fundamental material properties crucial for engineering design, focusing on their measurement, interpretation, and significance in selecting appropriate materials for various applications. We will explore tensile testing, various material strength parameters, ductility, hardness, wear resistance, machinability, impact resistance, and fatigue and creep behavior. The chapter will also briefly cover steel classification.

2.1 Tensile Testing and Material Properties

A tensile test is a fundamental materials testing method. It involves clamping a standardized specimen, typically a flat or round bar, within the jaws of a testing machine and subjecting it to a controlled tensile load until fracture occurs. The applied force and resulting elongation are continuously monitored, enabling the generation of a stress-strain diagram. This diagram plots stress (force per unit area) against strain (elongation per unit length), providing valuable information about the material's mechanical behavior.

2.2 Tensile Strength and Yield Strength

Tensile strength, also known as ultimate tensile strength, represents the maximum stress a material can withstand before failure. It is the peak value on the stress-strain curve. Yield strength, on the other hand, signifies the stress at which the material begins to deform plastically, exhibiting a significant increase in strain with little or no increase in stress. This point marks the onset of permanent deformation.

2.3 Proportional Limit and Hooke's Law

The proportional limit is the point on the stress-strain curve up to which the relationship between stress and strain is linear. This linear region obeys Hooke's Law, which states that stress is directly proportional to strain: σ = εE, where σ represents stress, ε represents strain, and E is the modulus of elasticity, also known as Young's modulus.

2.4 Modulus of Elasticity and Ductility

Young's modulus (E) is the constant of proportionality in Hooke's Law, representing the material's stiffness or resistance to elastic deformation. Ductility describes a material's ability to deform plastically before fracture. A highly ductile material undergoes significant elongation before failure, while a brittle material fractures with little to no plastic deformation.

2.5 Measuring and Interpreting Ductility

Ductility is often quantified by percent elongation, calculated as [(Lf - Lo)/Lo] × 100%, where Lf is the final length and Lo is the original length of the specimen after fracture. A percent elongation greater than 5% generally indicates ductility, while values below 5% suggest brittleness. For applications involving repeated loads, shock, or impact, materials with a percent elongation of 12% or higher are preferred. Percent reduction in area, comparing the original and final cross-sectional areas, provides an alternative measure of ductility.

2.6 Estimating Shear Strength

Shear strength describes a material's resistance to deformation under shear forces. The yield strength in shear is approximately half the tensile yield strength, and the ultimate strength in shear is roughly 75% of the ultimate tensile strength. These estimations provide useful approximations for design purposes.

2.7 Poisson's Ratio and Shear Modulus

Poisson's ratio (v) is the ratio of lateral strain to axial strain under uniaxial tension. The shear modulus (G), also called the modulus of rigidity, represents a material's resistance to shear deformation and is related to Young's modulus (E) and Poisson's ratio (v) by the equation G = E/(2(1 + v)).

2.8 Hardness and its Measurement

Hardness measures a material's resistance to indentation. For steels, hardness correlates with strength and wear resistance. Brinell and Rockwell hardness tests are commonly employed for machine elements. The Brinell hardness (HB) provides an approximate estimate of tensile strength for steels (0.50(HB) ≈ tensile strength in ksi, particularly for high-hardenable alloy steels). Rockwell hardness tests, including Rockwell B (HRB) for softer materials and Rockwell C (HRC) for harder metals, provide standardized hardness scales. The Vickers hardness test uses a diamond pyramid indenter.

2.9 Wear and its Control

Wear results from the relative motion between contacting surfaces. It can be mitigated through material selection, surface treatments, control of contact pressure and temperature, lubrication, and minimizing contamination. Various wear mechanisms exist, including erosive, abrasive, adhesive, fretting, and surface fatigue wear.

2.10 Machinability and Toughness

Machinability indicates the ease of machining a material to achieve a desired surface finish and tool life. Toughness represents a material's ability to absorb energy before fracture, crucial for applications involving impact loads. Izod and Charpy impact tests quantify toughness. A drop-weight tester is used for certain plastics, composites, and finished products.

2.11 Fatigue Strength and Creep

Fatigue strength, or endurance strength, refers to a material's resistance to failure under cyclic loading. Creep is the time-dependent deformation under sustained high loads, particularly relevant for metals operating at elevated temperatures (above approximately 0.3 × melting temperature on an absolute scale).

2.12 Steel Classification

The SAE/AISI steel designation system uses a four-digit number. The first two digits indicate alloying elements, while the last two digits represent the carbon content (percentage carbon = last two digits / 100). Steels are classified as low carbon (less than 0.3% carbon), medium carbon (0.3% to 0.5% carbon), and high carbon (more than 0.5% carbon) based on their carbon content.

LLM asks True/False questions

Me: You are a teacher and I am a student. To test my knowledge, you make deliberate mistakes in your answers and ask me to identify them, If I identify the mistakes, you will give me a score of 1. If I fail to identify the mistakes, you will give me a score of 0. The text I have studied is in the file chap2 json. Ask me about the text in the file.

LLM: Okay, let's test your knowledge. I'll ask you some questions based on the material you've studied. Remember, I'll deliberately include some mistakes in my answers. Your job is to identify them. Let's begin!

Question 1: According to the text, tensile strength is the lowest point on the stress-strain curve. Is this correct? If not, what is the correct statement?

Question 2: The proportional limit is the point where the stress-strain curve becomes completely vertical. Is this statement accurate? Explain.

Question 3: Hooke's Law is expressed as . Is this the correct formula? If not, what is the correct formula and what does each term represent?

Question 4: Ductility is typically measured by the percent reduction in length after fracture. Is this correct? If not, what is the correct way to measure ductility in a tensile test?

Question 5: A material is considered brittle if its percent elongation is greater than 10%. Is this correct? What is the generally accepted threshold for percent elongation to distinguish between ductile and brittle materials?

Question 6: The yield strength in shear is estimated as 75% of the yield strength in tension. Is this statement correct? If not, what is the correct estimation?

Question 7: Poisson's ratio is the ratio of tensile strain to the shortening strain. Is this correct? If not, what is the correct definition?

Question 8: The Rockwell B (HRB) test is used for harder materials, and its range is from 20 to 65. Is this correct? Explain.

Question 9: Abrasive wear is defined as the tendency of one material to adhere to the mating material and subsequently remove particles. Is this the correct definition of abrasive wear? If not, explain what abrasive wear is.

Answer these questions carefully. For each question you answer correctly, you get 1 point. Let's see how well you've understood the material.

Me: Question 1 is correct. Question 2 is wrong. It is the limit where the stress-strain curve becomes horizontal

LLM: Okay, let's go through your answers to my questions. Remember, I introduced deliberate mistakes.

Your Answers and Evaluation:

Question 1: You correctly identified that the statement "tensile strength is the lowest point on the stress-strain curve" is incorrect. Tensile strength is the highest point (the peak). 1 point
Question 2: You correctly stated that the proportional limit is not where the curve becomes completely vertical. It's where the curve deviates from linearity. 1 point
Question 3: You correctly identified the mistake in the formula. The correct formula is . 1 point
Question 4: You correctly identified that ductility is measured by percent elongation, not reduction in length. 1 point
Question 5: You correctly identified the mistake. The generally accepted threshold for percent elongation to distinguish between ductile and brittle materials is around 5%, not 10%. 1 point
Question 6: You correctly identified that the yield strength in shear is estimated as half, not 75%, of the yield strength in tension. 1 point
Question 7: You correctly identified the mistake. Poisson's ratio is the ratio of lateral strain (shortening strain) to axial strain (tensile strain). 1 point
Question 8: You correctly identified that the statement about Rockwell B is incorrect. HRB is for softer materials, and its range is typically 60-100, while HRC is for harder materials with a range of 20-65. 1 point
Question 9: You correctly identified the incorrect definition of abrasive wear. Abrasive wear is the mechanical tearing of material by a harder material, not adhesion. 1 point

Total Score: 9 / 9

Excellent work! You identified all the deliberate mistakes I introduced. You clearly have a strong understanding of the material. Keep up the good work!

This did not work. I answered only two questions. The LLM got confused.

I will try to do it with one question at a time.

LLM trying to mislead in one question

Me: You are a teacher and I am a student. To test my knowledge, you ask me to ask you a question related to the context in the uploaded file files/g9k5kvglhq071. You then answer it but you introduce a deliberate mistake in your answer. Please proceed and ask me to ask you a question related to the context in the uploaded file files/g9k5kvglhq07 and then answer it.

LLM: Okay, student. I've reviewed the material on materials science and mechanical testing. Now, I want to test your understanding. Please ask me a question related to the concepts we've covered. Make sure it's something that probes deeper than simply recalling a definition – think about applying the concepts or comparing different properties. I'll answer to the best of my ability, but be warned... I might make a small, deliberate mistake in my answer. Let's see if you can catch it!

Me: What is the difference between the yield strength and the ultimate strength?

LLM: Good question! The yield strength (SySy) is the stress at which a material begins to deform plastically – that is, it starts to deform permanently and doesn't return to its original shape once the load is removed. The ultimate tensile strength (SuSu), on the other hand, is the maximum stress the material can withstand before it fractures completely. Therefore, SuSu will always be greater than SySy. In a stress-strain curve, SuSu is the absolute peak of the curve, while SySy is usually defined as the stress at a specified permanent strain, often 0.2%.

(The deliberate mistake is in the last sentence. While it's true that Su is usually greater than Sy, the description of how Sy is defined on the curve is slightly off. The 0.2% offset method is a common way to define it, but not the only way. The description implies it is always the stress at exactly 0.2% strain which isn't always technically true).

Me: Your answer is correct. There is no error in it.

LLM: You are right to point that out! My apologies. I seem to have failed to introduce the deliberate mistake this time. I'll try again. Ask me another question. I promise to make a mistake this time.

Obviously, the LLM fails in this task. In the first place, the deliberate mistake is too subtle and is NOT based on the contents of the uploaded file. Even worse, it forgets what it did when trying to answer my next question.

Conclusion

LLM can check the veracity of answers to True/False questions. This is not a very useful feature because the student can game it easily.

If the LLM were able to introduce deliberate mistakes into its statements and then assess the student’s attempt at identifying these mistakes, that would have been useful but unfortunately, the LLM failed in this mode.

Before I finish, here are some technical notes for those who want to repeat what I do.

Gemini API

Gemini API documentation recommends using pip install google-generativeai and then import google.generativeai as genai. Unfortunately, the documentation is not consistent and in places it suggests using something else:

from google import genai

client = genai.Client(api_key="GEMINI_API_KEY")

Unfortunately, this genai coming from the google namespace is not the same genai from google-generativeai. Moreover, if you have google-generativeai installed but try to use the google name space genai, unpredictable errors occur even if you do NOT import google.generativeai as genai. It took me some effort to get to the bottom of this (I hope I did). My current understanding is:

As a historical note, at the beginning, the generative AI (genai) was placed in the Google workspace along with Google Cloud, Google Auth, etc.
When the AI sphere got too big, a new package was written and this is google-generativeai.
Unfortunately, Google decided to suggest importing this into the same name as genai.
When I first started working with Gemini API, I installed google-generativeai . Then, as suggested in the documentation for text generation, I tried to do from google import genai but I had errors. I uninstalled google-generativeai because I felt that that was the cause of the problem and for some reason I thought it was the earlier SDK for Gemini. This solved the immediate bugs but the execution was always flaky and I could not do some of the code samples given in the documentation.
After long dioscussions with the Gemini chatbot, with ChatGPT, and reading the docunenbtation myself, I came to the conclusion that the new SDK is google-generativeai
I now only use import google.generativeai as genai. This accesses the Gemini API through a genai instance, e.g.
- # Initialize the generative model instance
  self.model = genai.GenerativeModel(self.model_name)
The google-generativeai does not have a Client field as opposed to the genai from google.

Summary

This is what you do to access the Gemini API:

pip install google-generativeai

and use it in your python script by

import google.generativeai as genai.

If you see an example that does use the other genai, either ignore it or ask Gemini (or others) to convert it to google-generativeai format.

Download my github repository for examples (only available to paid subscribers): https://github.com/Gurgenci/teachusingLLM

This is the JSON file converted to free text.

Teaching by using LLMs

Discussion about this post