• It depends on the temperature. There’s a variable you can play with that adds since randomness to the responses (LLMs are fully deterministic when temperature is 0). Sometimes the F1 or F2 score is used to determine correctness of many questions, but I don’t have a great understanding of how that metric works and what ChatGPTs is.