latest update broke use
Since this, the example fails:
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse
model_name_or_path = "TheBloke/Nous-Hermes-13B-GPTQ"
model_basename = "nous-hermes-13b-GPTQ-4bit-128g.no-act.order"
use_triton = False
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=True,
device="cuda:0",
use_triton=use_triton,
quantize_config=None)
says:
(h2ogpt) jon@pseudotensor:~/h2ogpt$ python
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import AutoTokenizer, pipeline, logging
>>> from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
>>> import argparse
>>>
>>> model_name_or_path = "TheBloke/Nous-Hermes-13B-GPTQ"
>>> model_basename = "nous-hermes-13b-GPTQ-4bit-128g.no-act.order"
>>> use_triton = False
>>>
>>> tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
>>>
>>> model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
... model_basename=model_basename,
... use_safetensors=True,
... trust_remote_code=True,
... device="cuda:0",
... use_triton=use_triton,
... quantize_config=None)
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ <stdin>:1 in <module> โ
โ โ
โ /home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/auto_gptq/modeling/auto.py:94 in โ
โ from_quantized โ
โ โ
โ 91 โ โ โ for key in signature(quant_func).parameters โ
โ 92 โ โ โ if key in kwargs โ
โ 93 โ โ } โ
โ โฑ 94 โ โ return quant_func( โ
โ 95 โ โ โ model_name_or_path=model_name_or_path, โ
โ 96 โ โ โ save_dir=save_dir, โ
โ 97 โ โ โ device_map=device_map, โ
โ โ
โ /home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/auto_gptq/modeling/_base.py:714 in โ
โ from_quantized โ
โ โ
โ 711 โ โ โ โ โ break โ
โ 712 โ โ โ
โ 713 โ โ if resolved_archive_file is None: # Could not find a model file to use โ
โ โฑ 714 โ โ โ raise FileNotFoundError(f"Could not find model in {model_name_or_path}") โ
โ 715 โ โ โ
โ 716 โ โ model_save_name = resolved_archive_file โ
โ 717 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
FileNotFoundError: Could not find model in TheBloke/Nous-Hermes-13B-GPTQ
>>>
However this now works:
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse
model_name_or_path = "TheBloke/Nous-Hermes-13B-GPTQ"
model_basename = "model"
use_triton = False
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=True,
device="cuda:0",
use_triton=use_triton,
quantize_config=None)
Correct, I have renamed all models to model.safetensors
to prepare for native Transformers GPTQ support which is coming in the next couple of days.
I've updated all my code examples to show that model_basename = "model"
should be used. But I've not yet put out more detailed documentation. That will be coming to all GPTQ repos as soon as the new Transformers version goes live, hopefully tomorrow.
In fact you can now leave out model_basename
entirely - I also updated quantize_config.json
to indicate that model_basename=model
so there's no need to manually specify model_basename
in .from_quantized()
any more. When I update the docs properly I will remove that. Actually I'll remove all AutoGPTQ code, and show loading it directly from Transformers.
So will you be moving away from AutoGPTQ as main inspiration for GPTQ? I know you tracked his project and was pushing him some to get working more on it :) Once in transformers, no need for AutoGPTQ or GPTQ-for-LLaMa?
The Transformers implementation uses AutoGPTQ as its backend, so AutoGPTQ will still be required. To use GPTQ in Transformers, the user will need three packages:transformers optimum auto-gptq
So AutoGPTQ will still be vital.
But yeah GPTQ-for-LLaMa is dead as far as I'm concerned!
Ok cool.