How to fully fine-tune Mixtral 8x7b without using any adapters ?

#52
by cuongk14 - opened

Hi,
I have tried different ways (deepspeed and FSDP) on a clusters with 10 A100 80GB but it always ended up with out of memory issues. Anyone here successfully fine-tuned the model without using any popular adapters like QLORA. The following is my Deepspeed config:

    "zero_optimization": {
      "stage": 3,
      "offload_param": {
        "device": "cpu"
      },
      "offload_optimizer": {
        "device": "cpu"
      }
    },
    "gradient_accumulation_steps":"auto",
    "train_micro_batch_size_per_gpu": 1,
    "gradient_clipping": 1.0,
    "fp16": {
      "enabled": true
    }
  }```

And the following is a simple script  train.py (using batch-size = 1, and very small context length to test). 

``` import deepspeed
deepspeed.ops.op_builder.CPUAdamBuilder().load()

from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset
from transformers import DataCollatorForLanguageModeling
from transformers import TrainerCallback
import torch
import transformers
from trl import SFTTrainer

def main():
    # Load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-step-50K-105b')
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    # Load dataset from the Hugging Face datasets library
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
    # Tokenize the texts
    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)
    tokenized_datasets = dataset.map(tokenize_function, batched=True)
    # Load the data collator
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False,  # TinyLlama uses a causal (not masked) language model, similar to GPT-2
    )
    # Load the model
    model = AutoModelForCausalLM.from_pretrained('mistralai/Mixtral-8x7B-v0.1',
                                                 torch_dtype=torch.bfloat16
                                                 )
    model.resize_token_embeddings(len(tokenizer))
  
    # Define the training arguments
    training_args = TrainingArguments(
        optim="adamw_torch",
        save_strategy="epoch",
        output_dir="./result",
        overwrite_output_dir=True,
        num_train_epochs=3,
        per_device_train_batch_size=1,
        save_steps=10_000,
        save_total_limit=2,
        fp16=True, 
        deepspeed="zero3.json",  # Path to DeepSpeed config file        
        gradient_checkpointing=True,
    report_to='wandb'    
    )

    trainer = SFTTrainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        dataset_text_field="text",
        train_dataset= dataset["train"],
        eval_dataset= dataset["validation"]
    )

    trainer.train()
if __name__ == "__main__":
    main() ```


Here is command to execute the above train.py
` deepspeed --include localhost:0,1,2,3,4,5,6,7,8,9  train.py `

Hi @cuongk14 !
You might be interested in GaLore algorithm: https://huggingface.co/docs/transformers/v4.40.0/en/trainer#galore that enables parameter efficient full pre-training & fine-tuning. Note this does not support DeepSpeed yet

Sign up or log in to comment