It depends on the temperature. There’s a variable you can play with that adds since randomness to the responses (LLMs are fully deterministic when temperature is 0). Sometimes the F1 or F2 score is used to determine correctness of many questions, but I don’t have a great understanding of how that metric works and what ChatGPTs is.
I wanna see the results if you ask ChatGPT the same question a million times. What percentage of responses would actually get the correct number?
I think that heavily depends on whether it gets the initial answer right, since it will use that as context
When you’re calling it through an API then you can simply choose not to pass it any context
Fair enough
It depends on the temperature. There’s a variable you can play with that adds since randomness to the responses (LLMs are fully deterministic when temperature is 0). Sometimes the F1 or F2 score is used to determine correctness of many questions, but I don’t have a great understanding of how that metric works and what ChatGPTs is.