Is it possible to run Mistral-7B-v0.1 in the free-tier Google Colab notebook?
Hello,
I tried to run https://huggingface.co/mistralai/Mistral-7B-v0.1 in a free Google Colab notebook (for a fine-tuning experiment). However, unfortunately, I am getting a CPU-side out-of-memory error (kernel crashes) while trying to even download the model checkpoint from HuggingFace to the notebook (in other words, the process does not even get to the GPU stage). Am I doing something wrong? Or is this model just too large for the free-tier Google Colab account? I counted that the two binary files add up to 15GB, but am not sure if this is the correct calculation. Or perhaps there is a special technique that has to be employed in order to be able to download this model? It would be nice to get the official guidance; or maybe somebody knowledgeable can advise.
Thank you very much in advance for your help.
Thank you very much; this is helpful, although it does not answer the question about the amount of memory needed. Still, thank you for the very useful notebook.
I want to run quantized version. This model runs out of memory in free google colab
I am still getting OUT OF MEMORY error. Even using this notebook https://colab.research.google.com/drive/1F2PeWl5FOHv4sjd7XTEu40JjqbFhC3LB?usp=sharing
Yes; I believe that downloading the model requires 15GB, which is more RAM than the free Colab account has. The memory gets freed up after the download, but it is needed for the download to succeed. Thank you.
Thanks me later.
You can load it using this on free colab.
!pip -q install git+https://github.com/huggingface/transformers # need to install from github
!pip -q install bitsandbytes accelerate xformers einops
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
quantization_config=bnb_config,
device_map='auto',
)
tokenizer = transformers.AutoTokenizer.from_pretrained(
model_id,
)
text = "[INST] What is your favourite condiment? [/INST]Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen! [INST] Do you have mayonnaise recipes? [/INST]"
encodeds = tokenizer(text, return_tensors="pt", add_special_tokens=False)
model_inputs = encodeds
generated_ids = model.generate(**model_inputs, max_new_tokens=200, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])
@girrajjangid , its using system RAM in colab ..why? It runs out of memory with your code
I’ve managed to load it into the gpu in a free instance by loading it when it is sharded into 2gb files https://huggingface.co/someone13574/Mistral-7B-v0.1-sharded
Thank you everybody for your very helpful responses. I will be learning from them in the coming days. Much appreciated!
@someone13574 Would you be so kind as to point to the tools and/or code for how you carried out the sharding? I feel that it would be useful to a lot of people. Thank you very much in advance.
@alexsherstinsky Just a simple script using huggingface transformers. You do need enough ram to fit the model though.
import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("model/", low_cpu_mem_usage=True, torch_dtype=torch.float16, device_map="cpu")
model.save_pretrained("sharded", max_shard_size="2GB")
@someone13574 Thank you so much for this code and the explanation. Yes, this really helps -- because in terms of RAM, I can use my development machine, which is Mac M1 Max with 64GB RAM, and then upload the results to HuggingFace -- and then go from there (I verified that your sharded model fits very comfortably in the free Google Colab tier instance with the T4 GPU. So thank you very much again for this. Cheers!
@someone13574
I have a follow up question for you, if I may, please. In the line model.save_pretrained()
-- I do not see the path specified. Is it possible to save the sharded model to a HuggingFace location in my account? Thank you. Also, once I get this to work on my end, I would like to use it in a pull request to an open source library. What is the best way to acknowledge you for this idea? Thank you very much again.
@someone13574
Oh, I think we just say model.save_pretrained(path_on_hugging_face, max_shard_size="2GB")
-- is this correct?
@alexsherstinsky
model.save_pretrained()
saves the model to the local path specified, which in the case of the code example I put above was a directory called sharded
. Then you can just upload it manually from the website. I believe that you can push directly to huggingface using push_to_hub()
, but I'm not sure if anything other than safetensors are supported, or if sharded models are supported. I just did it manually.
(Also, I don't need to be acknowledged.)
@someone13574 Got it -- really appreciate your explanation. Thanks a lot!
@someone13574 Sorry to bother you again: one more question, if I may, please. I tried your procedure, but ended up with only 7 (instead of 8) shard files. In addition, I did not get README in there. Even more importantly, I did not get the added tokens JSON file. Could you please share the code for how you added the special tokens
{
"</s>": 2,
"<s>": 1,
"<unk>": 0
}
to your tokenizer ?
Just in case it helps, here is the code I used:
mistral_7b_original_base_model_name: str = "mistralai/Mistral-7B-v0.1"
mistral_7b_sharded_base_model_name: str = "alexsherstinsky/Mistral-7B-v0.1-sharded"
original_base_model_tokenizer: LlamaTokenizerFast = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=mistral_7b_original_base_model_name, trust_remote_code=True, padding_side='left')
original_base_model: MistralForCausalLM = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=mistral_7b_original_base_model_name, torch_dtype=torch.float16, trust_remote_code=True, device_map="auto", low_cpu_mem_usage=True)
original_base_model.save_pretrained(save_directory=f"/content/models/{mistral_7b_sharded_base_model_name}", max_shard_size="2GB", push_to_hub=True)
original_base_model_tokenizer.save_pretrained(save_directory=f"/content/models/{mistral_7b_sharded_base_model_name}", legacy_format=False, push_to_hub=True)
Thanks a lot again!
@alexsherstinsky
I just copied the files not save from save_pretrained
, such as the tokenizer and all other files other than the model weights and the weight index over from the original model, as they should be the same whether or not the model is sharded.
@someone13574 Got it -- this was very helpful. Do you by chance know how we could have ended up with the different number of shard files? Or did you copy the original base model's one as well? Thanks a lot!
@alexsherstinsky Not sure what would have caused that.
@someone13574 No matter what I have tried, I am unable to shard the way you have -- I always get 7 instead of 8 files. I even tried to download your model and upload it sharded, and still get 7 files. And loading my sharded result fails with this exception:
/usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics)
315 module._parameters[tensor_name] = param_cls(new_value, requires_grad=old_value.requires_grad)
316 elif isinstance(value, torch.Tensor):
--> 317 new_value = value.to(device)
318 else:
319 new_value = torch.tensor(value, device=device)
NotImplementedError: Cannot copy out of meta tensor; no data!
Could the problem be the version of PyTorch and HuggingFace Transformers libraries that I am using and which may be wrong? Could you please tell me the versions of these packages that you are using?
I think that my PyTorch version is 2.0.1+cu118 and Transformers version is 4.35.0.dev0 -- could this be my issue?
Thanks a lot for your help.
@someone13574
I believe that I found the issue. It seems that the needed version of transformers should be 4.34.0
. When I used that and did the sharding from my local machine (instead of Google Colab), everything worked properly. Sorry to have disturbed you with the messages, and thank you again for your help.
Glad that you could make it work, supported version at the moment is indeed 4.34.0. We've clarified the README in that regard!
@lerela Thank you for confirming. Perhaps Mistral AI would consider making the official model sharded in order to make it easier for new users to use it in low-resource hardware environments. Thanks again.
Can you please share the full colab example?
@Manu9000k I am preparing it and hope to share early next week; I will paste it here -- thank you for asking.
@Manu9000k Thank you very much! It is really my pleasure -- I am glad it helps -- enjoy!
Okay, now the loading issue is out of the way, has anyone succeeded in finetuning Mistral through google colab?
I think you should be able to do that if you use the left library cc @ybelkada we have tutorials for Llama I think no?
I've tried Llama and it worked with a few dirty fixes currently but this one has a different tokenizer and attention mechanism (if I'm not mistaken) so I'm not sure if peft/bits&bytes/SFFTrainer are good to go for that. (These are the components I'm using)
@NPap Have you tried the procedure outlined in https://huggingface.co/mistralai/Mistral-7B-v0.1/discussions/13#65211649706c7551487c999a (an earlier message above)? Thank you.
Hi everyone,
You can indeed fine-tune Mistral-7B on a free tier google colab instance. I made a colab notebook for it here: https://colab.research.google.com/drive/1DNenc5BpdqaS10prtklYyIe9qW_7gUnb?usp=sharing
Make sure to use the sharded version of the model that I have pushed under my namespace here as currently since the largest shard is 10GB, it leads to CPU OOM if you try to use mistralai/Mistral-7B-v0.1
I recommend to train your model with packing
to avoid issues presented in this GH thread: https://github.com/huggingface/transformers/issues/26498
Let us know here how the training goes - fine-tuning the model on the entire guanaco dataset seems to take ~4 hours. This can be further reduced down once torch.scaled_dot_product_attention
will be integrated in transformers core : https://github.com/huggingface/transformers/pull/26572
Thanks for sharing the notebook! I was wondering if you need to format the training dataset to conform to:
Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s>```
Or does the tokenizer automatically do that?
like mentioned in https://adithyask.medium.com/a-beginners-guide-to-fine-tuning-mistral-7b-instruct-model-0f39647b20fe and https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1#instruction-format
Thanks for sharing the notebook! I was wondering if you need to format the training dataset to conform to:
Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s>``` Or does the tokenizer automatically do that? like mentioned in https://adithyask.medium.com/a-beginners-guide-to-fine-tuning-mistral-7b-instruct-model-0f39647b20fe and https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1#instruction-format
Check the tokenizers parameters, I think its eos, bos, and special tokens arguments. (Thats for the and , I think you will need to put [INST] and [/INST] yourself)
Hi everyone,
You can indeed fine-tune Mistral-7B on a free tier google colab instance. I made a colab notebook for it here: https://colab.research.google.com/drive/1DNenc5BpdqaS10prtklYyIe9qW_7gUnb?usp=sharing
Make sure to use the sharded version of the model that I have pushed under my namespace here as currently since the largest shard is 10GB, it leads to CPU OOM if you try to use
mistralai/Mistral-7B-v0.1
I recommend to train your model with
packing
to avoid issues presented in this GH thread: https://github.com/huggingface/transformers/issues/26498Let us know here how the training goes - fine-tuning the model on the entire guanaco dataset seems to take ~4 hours. This can be further reduced down once
torch.scaled_dot_product_attention
will be integrated in transformers core : https://github.com/huggingface/transformers/pull/26572
In addition to this,
Try training it on Kaggle, it worked for me with training batch size 2.
In colab what worked for me was decreasing the max_sequence_length to 256 and turning on double quantization with the batch size =1. (GPU RAM 10.5 / 15.0 GB)
Please check the table at the end here to make an informed assumption: https://huggingface.co/blog/4bit-transformers-bitsandbytes