There is a reason there is sometimes a notable decrease in quality of the same AI model a while after it’s released.
Hosters of the models (like OpenAI or Microsoft) may have switched to a quantized version of their model. Quantization is a common practice to increase power efficiency and make the model easier to run, by essentially rounding the weights of the model to a lower precision. This decreases VRAM and storage usage significantly, at the cost of a bit of quality, where higher quantization results in worse quality.
For example, the base model will likely be in FP16, full floating point precision. They may switch to a Q8 version, which nearly halves the size of the model, with about a 3-7% decrease in quality.
Expertly explained. Thank you! It’s pretty rad what you can get out of a quantized model on home hardware, but I still can’t understand why people are trying to use it for anything resembling productivity.
There is a reason there is sometimes a notable decrease in quality of the same AI model a while after it’s released.
Hosters of the models (like OpenAI or Microsoft) may have switched to a quantized version of their model. Quantization is a common practice to increase power efficiency and make the model easier to run, by essentially rounding the weights of the model to a lower precision. This decreases VRAM and storage usage significantly, at the cost of a bit of quality, where higher quantization results in worse quality.
For example, the base model will likely be in FP16, full floating point precision. They may switch to a Q8 version, which nearly halves the size of the model, with about a 3-7% decrease in quality.
But if that’s how you’re going to run it, why not also train it in that mode?
That is a thing, and it’s called quantization aware training. Some open weight models like Gemma do it.
The problem is that you need to re-train the whole model for that, and if you also want a full-quality version you need to train a lot more.
It is still less precise, so it’ll still be worse quality than full precision, but it does reduce the effect.
Your response reeks of AI slop
4/10 bait
Is it, or is it not, AI slop? Why are you using so heavily markdown formatting? That is a telltale sign of an LLM being involved
I am not using an llm but holy bait
Hop off the reddit voice
…You do know what platform you’re on? It’s a REDDIT alternative
Expertly explained. Thank you! It’s pretty rad what you can get out of a quantized model on home hardware, but I still can’t understand why people are trying to use it for anything resembling productivity.
It sounds like the typical tech industry:
“Look how amazing this is!” (Full power)
“Uh…uh oh, that’s unsustainable. Let’s quietly drop it.” (Way reduced power)
“People are saying it’s not as good, we can offer them LLM+ plus for better accuracy!” (3/4 power with subscription)