4 bit version?

#3
by mpasila - opened

I tried doing it myself but ran into problems when using this: https://github.com/0cc4m/GPTQ-for-LLaMa (it adds support for mpt models)

I was looking into this as well. I tried to use main GPTQ-for-llama to quant it (this model just sounds a million times more promising than the original) but I'm getting errors because it is not a llama model. I saw that like a week ago the Occam released a quanted version, so it is doable (https://huggingface.co/OccamRazor/mpt-7b-storywriter-4bit-128g). I just don't know how. I also looked through occam's github with his version of koboldai and originally just didn't see his GPTQ implementation.

Anyway, now that I see mpasila's link I'm going to try that route. I have data right now too so if it works I would be happy to upload a working model. Maybe thebloke will beat me to it hah

Edit: I tried every which way to make the GPTQ that was linked above work. Does anyone have the sauce. I even tried the gptneox which at least failed different way (cuda memory over run). When I tried to run with llama version it screws up every time talking about the tokenizer not being compatable with the neox style tokenizer.

I also tried installing the two different ways. The old way with the conda env and the new way by making a new conda env and then running the pip install git command they have listed on the repo. Couldn't get the pip install way to work at all.

I will have a look tomorrow if I have the time

so if i had to guess we need that layer mapping...

Looking forward to it! @TheBloke Thanks! :D

Sign up or log in to comment