Alternate quantizations

#2
by ZeroWw - opened

These are my own quantizations (updated almost daily).

The difference with normal quantizations is that I quantize the output and embed tensors to f16.
and the other tensors to 15_k,q6_k or q8_0.
This creates models that are little or not degraded at all and have a smaller size. They run at about 3-6 t/sec on CPU only using llama.cpp
And obviously faster on computers with potent GPUs

https://huggingface.co/ZeroWw/Yi-1.5-6B-Chat-GGUF

01-ai org

Thank you ZeroWw, That's a fresh perspective for me! Did you write your own tool for this purpose?
Also, by little or no degrade, did you run a quantitative eval or just subject evaluation?

Thank you ZeroWw, That's a fresh perspective for me! Did you write your own tool for this purpose?
Also, by little or no degrade, did you run a quantitative eval or just subject evaluation?

Subject evaluation, I don't have the resource to do much (but you are welcome to offer them to me)

I use the normal quantizing tool I just set the output and embed tensors to f16 and quantized everything else.
That's because the output and embed tensors are the ones responsible for the main "understanding and expressing".
Most quantizations quantize everything in the same way and that's a mistake in my opinion.

Sign up or log in to comment