8bit quantization
Any plans for 8bit quantized model? I see that you don't make such models, why so? I think this is the best for GPU usage
GGML models are CPU only, so GPU isn't involved.
I've never bothered with q8 because q5_1 is already so incredibly close to fp16, that there didn't seem any point.
Here's the quantisation table from the README of llama.cpp:
On a 13B model like this, fp16 scores 5.2455 and q5_1 is 5.2582. That's a difference of 0.24%. Q8 scores 5.2458, which is 0.05% 'worse' than FP16. So it is better than q5_1, but is anyone ever really going to be able to spot the difference between 0.24% higher perplexity vs 0.05% higher?
So that's why I never bothered. Then again I guess I could upload them just for completeness! It's not like it uses any disk space for me once I've uploaded them - that's on HF :)
OK next time I do a model I'll do q8 as well, and maybe I'll add some q8's for the last couple of models I did as well.