Selective fine-tuning of Language Models with Spectrum

Community Article Published September 3, 2024

Spectrum is a new technique that identifies the most informative layers in a Language Model. Based on this analysis, you can selectively fine-tune only a fraction of the model, optimizing training efficiency.

In this article, we'll introduce Spectrum and demonstrate how to apply it by fine-tuning Phi-3.5-mini-instruct to enhance its performance in Italian, using Hugging Face TRL. The resulting model is ๐Ÿ’ฌ๐Ÿ‡ฎ๐Ÿ‡น Phi-3.5-mini-ITA.

This article provides a complete walkthrough; for just the code, refer to the training notebook.

๐ŸŽฏ Spectrum

Intuition

When we mention "layers" in this article, we're not talking about the higher-level Transformer layers (model.layers.0, model.layers.1, ...). Instead, we're referring to the lower-level layers (model.layers.0.mlp.down_proj, model.layers.0.self_attn.o_proj, ...), each associated with a specific weight matrix.

Recently, several techniques have emerged to fine-tune Language Models efficiently, saving computational resources and time.

A very popular method is QLoRa which quantizes the original model and trains low-rank adapters on top of it. This approach gives impressive results (slightly worse than full fine-tuning) while utilizing only a fraction of the GPU resources.

However, QLoRa applies Low-Rank Adaptation uniformly across the entire model.

What if we could identify the most informative layers and only fine-tune those?

This is exactly what Spectrum does!

  • Spectrum analyzes the weight matrices for all layers in a Language Model and calculates a Signal to Noise Ratio (SNR) for each one.
  • It uses Random Matrix Theory and Marchenko-Pastur distribution to distinguish signal from noise.
  • Based on a chosen percentage (say, 25%), Spectrum selects the most informative layers of each type (e.g., mlp.down_proj, self_attn.o_proj, etc.).
  • You can then freeze the entire model except for these selected layers and focus your fine-tuning on them.

image/png

Evaluations and results

In the paper, the authors fine-tuned Llama-3-8B and Mistral-7B-v0.1 on airoboros-3.1 dataset using Spectrum-50 and Spectrum-25, and compared the results with full fine-tuning and QLoRA.

Spectrum is competitive with full fine-tuning and beats QLoRA on benchmark performance.

On a single GPU, QLoRA is more memory-efficient, while Spectrum shines in distributed training setups (DeepSpeed ZeRO-3 and FSDP).

image/png

Several impressive Language Models were trained using this technique: various Dolphin models, Llama 3.1 Storm, numerous models by VAGO Solutions...

๐Ÿ‡ฎ๐Ÿ‡น Fine-tune Phi 3.5 mini with Spectrum and TRL

Use case

Let's apply Spectrum to a specific use case: improving the Italian performance of Phi-3.5-mini-instruct. This is a good small Language Model (3.82 B parameters) and it already performs decently in Italian.

To evaluate its Italian language capabilities, we refer to the Open ITA LLM Leadearboard, a community-driven project maintained by Samuele Colombo and Alessandro Ercolani. This leaderboard uses the lm-evaluation-harness framework to assess models based on three benchmarks: MMLU_IT, ARC_IT, and HELLASWAG_IT.

We will use Spectrum to select the most informative layers and then train them using the Hugging Face TRL library. Spectrum is compatible out-of-the-box with Aloxotl, but manually applying the layer selection with TRL is a good learning experience. Plus, TRL is a great project.

For this experiment, I'll be using a single NVIDIA A6000 GPU (48 GB VRAM), but you can adapt this to smaller GPUs by playing around with gradient accumulation.

Setup

First, let's install the necessary libraries.

pip install datasets transformers trl accelerate scipy

To speed up training, we'll also install flash attention, which is compatible with modern GPUs.

pip install ninja packaging
MAX_JOBS=6 pip install flash-attn --no-build-isolation --upgrade

Data preparation

For improving models on non-English languages, incorporating both English and the target language in the training data can be beneficial. This has been demonstrated by models from VAGO Solutions and LLaMAntino-3.

We will use a mix of good English and Italian instruct/chat data: mlabonne/FineTome-100k + efederici/capybara-claude-15k-ita.

Steps:

  • Adapt the datasets to a common format.
  • Apply the Phi 3.5 mini chat template.
  • Create a unified dataset and reserve a small fraction for evaluation.
from datasets import load_dataset, Dataset, concatenate_datasets
from transformers import AutoTokenizer
import multiprocessing

# Load and process FineTome dataset
finetome_ds = load_dataset("mlabonne/FineTome-100k")["train"]
mapping_keys, mapping_values = {"from": "role", "value": "content"}, {"human": "user", "gpt": "assistant"}

def process_conversation(row):
    conv = row["conversations"]
    new_conv = [{mapping_keys[k]: mapping_values.get(v, v) for k, v in msg.items()} for msg in conv]
    return {"conversations": new_conv}

finetome_ds = Dataset.from_list([process_conversation(row) for row in finetome_ds])

# Load tokenizer and define template function
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct", trust_remote_code=True)

def apply_template(examples):
    text = [tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False) for msg in examples["conversations"]]
    return {"text": text}

finetome_ds = finetome_ds.map(apply_template, batched=True).remove_columns("conversations").shuffle(seed=42)
finetome_ds = finetome_ds.add_column("origin", ["finetome"] * len(finetome_ds))

# Load and process Capybara Claude dataset
capyclaude_ds = load_dataset("efederici/capybara-claude-15k-ita", split="train")
capyclaude_ds = capyclaude_ds.map(apply_template, batched=True).remove_columns(["conversations", "hash"]).shuffle(seed=42)
capyclaude_ds = capyclaude_ds.add_column("origin", ["capyclaude"] * len(capyclaude_ds))

# Concatenate and split datasets
mixed_ds = concatenate_datasets([finetome_ds, capyclaude_ds]).shuffle(seed=42)
mixed_ds = mixed_ds.class_encode_column("origin").train_test_split(test_size=0.005, stratify_by_column="origin")

We can then check an example to see how it looks:

# mixed_ds["train"][587]

{'text': '<|system|>\nYou are a helpful assistant, with no access to external functions.<|end|>\n<|user|>\nEdit the following sentence to make the tense of the verb consistent.\nHe had gone to the store yesterday evening.<|end|>\n<|assistant|>\nHe went to the store yesterday evening.<|end|>...|endoftext|>',
 'origin': 1}

max_seq_length

Later, we'll need to set a max_seq_length value, which indicates the maximum sequence length to be considered during training. Longer examples will be truncated.

It is important to choose wisely this value, so that we don't cut off too much relevant information, but also don't waste GPU resources.

Let's see what happens if we set max_seq_length to 2048.

from scipy.stats import percentileofscore
import multiprocessing

def calculate_lengths(batch):
    return {"conv_lengths": [len(tokenizer(text)["input_ids"]) for text in batch["text"]]}

conv_lengths = mixed_ds["train"].map(
    calculate_lengths,
    batched=True,
    batch_size=1000,
    num_proc=multiprocessing.cpu_count()
)["conv_lengths"]

chosen_length=2048

percentile = percentileofscore(conv_lengths, chosen_length)
print(percentile)
# 91.91453560724239

By choosing a maximum length of 2048, we'll only truncate 8% of our examples. Fine!

Load the original model

Next, let's load the original model we'll be training.

from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3.5-mini-instruct",
    use_cache=False,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
    trust_remote_code=True
)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'right'

This code is adapted from Phi-3.5-mini-instruct official fine-tuning example.

  • use_cache is set to False: cache is helpful at inference time, but wastes memory during training (resources: #1, #2, #3).

  • trust_remote_code is set to True: with transformers==4.44.2, this is needed to incorporate a minor bug fix in Phi3ForCausalLM. Read this discussion for more details.

  • During training, pad_token is set to unk instead of eos token to prevent endless generation. This change must be reverted after training.

  • At training time, tokenizer.padding_side is set to right (required by TRL SFTTrainer). This change must be reverted after training: for generation, tokenizer.padding_side must be set to left.

Identify layers to train with Spectrum

Now, let's figure out which layers we want to train using Spectrum.

Since the official Spectrum script doesn't work in notebook environments, you'll need to run it in a shell.

First, we install Spectrum:

git clone https://github.com/cognitivecomputations/spectrum.git
cd spectrum
pip install -r requirements.txt

Then we launch the script:

python spectrum.py --model-name <insert local or HF repo here> --top-percent <top % of snr ratios to target>

If someone has already scanned our model and uploaded the results to Spectrum repo, you are lucky and you can immediately get a YAML file with the parameters to train.

Otherwise, like in our experiment, we need to scan the model ourselves. For our experiment, we're targeting the top 30% of model layers.

python spectrum.py --model-name microsoft/Phi-3.5-mini-instruct --top-percent 30

We will be asked a batch size for the scan (default is 1).

Then we will be asked which layer types to scan. The authors recommend at least selecting the MLP and Attention layers, which we'll do here.

image/png

The computation takes less than 2 minutes for our model (3.82 B parameters) on an A6000 GPU with a batch size of 1.

We end up with a YAML file listing the top 30% of the most informative layers.

unfrozen_parameters:
- ^lm_head.weight$
- ^model.embed_tokens.weight$
# mlp.down_proj layers
- model.layers.2.mlp.down_proj
- model.layers.3.mlp.down_proj
...
# mlp.gate_up_proj layers
- model.layers.31.mlp.gate_up_proj
- model.layers.4.mlp.gate_up_proj
...
# self_attn.o_proj layers
- model.layers.0.self_attn.o_proj
- model.layers.1.self_attn.o_proj
...
# self_attn.qkv_proj layers
- model.layers.23.self_attn.qkv_proj
- model.layers.24.self_attn.qkv_proj
...

This YAML file can be directly used in Aloxotl.

With TRL, we need to take a few more manual steps.

We load the YAML file, define a simple freeze_and_unfreeze_parameters utility function and apply it to our model.

We are freezing all the model parameters and unfreezing those selected by Spectrum.

import re

with open("snr_results_microsoft-Phi-3.5-mini-instruct_unfrozenparameters_30percent.yaml", "r") as fin:
    yaml_parameters = fin.read()

unfrozen_parameters = []
for line in yaml_parameters.splitlines():
  if line.startswith("- "):
    unfrozen_parameters.append(line.split("- ")[1])

def freeze_and_unfreeze_parameters(model, unfrozen_parameters):
    # freeze all parameters
    for param in model.parameters():
        param.requires_grad = False
    # unfreeze Spectrum parameters
    for name, param in model.named_parameters():
        if any(re.match(unfrozen_param, name) for unfrozen_param in unfrozen_parameters):
            param.requires_grad = True

freeze_and_unfreeze_parameters(model, unfrozen_parameters)

# let's do a quick sanity check
for name, param in model.named_parameters():
    if param.requires_grad:
      print(name, param.requires_grad)

# model.embed_tokens.weight True
# model.layers.0.self_attn.o_proj.weight True
# model.layers.1.self_attn.o_proj.weight True
# model.layers.1.mlp.down_proj.weight True
# ...

Everything looks good, and we're almost ready to start training our model.

Configure TRL SFTTrainer and train!

To perform Supervised Fine Tuning, TRL offers the SFTTrainer. Let's configure it.

from trl import SFTConfig, SFTTrainer

new_model_id="anakin87/Phi-3.5-mini-ITA"

cfg = SFTConfig(
    output_dir='./mymodel',
    overwrite_output_dir = True,
    hub_model_id=new_model_id,
    hub_strategy="every_save",
    save_strategy="steps",
    save_steps=500,
    save_total_limit=1,
    push_to_hub=True,
    logging_steps=20,
    max_seq_length=2048,
    dataset_text_field="text",
    remove_unused_columns=True,
    packing=True,    
    num_train_epochs=2,
    lr_scheduler_type="cosine",
    warmup_ratio=0.2,                       
    bf16=True,                              
    tf32=True,                              
    learning_rate=5.0e-06,
    per_device_train_batch_size=8,
)

sft_trainer = SFTTrainer(
    model=model,
    args=cfg,
    train_dataset=mixed_ds["train"],
    tokenizer=tokenizer
)

Here's a quick overview of the key configurations:

  • max_seq_length=2048: Explained earlier.
  • dataset_text_field="text": The name of the text field in our prepared dataset.
  • packing=True: This enables example packing, where multiple short examples are packed into the same input sequence to increase training efficiency.
  • learning_rate=5.0e-06: This is lower than the usual learning rate for instruction fine-tuning. The value is taken from Phi-3.5-mini-instruct official fine-tuning example. Maybe it is related to the fact that this model is already fine-tuned. I've personally found that higher learning rates (like 2e-5) can lead to performance degradation with this model.
  • per_device_train_batch_size=8: This is set to fully utilize the 48GB VRAM of our A6000 GPU. If you're using a smaller GPU, consider using gradient accumulation to reduce the computational load. For example, you can set per_device_train_batch_size=2 and gradient_accumulation_steps=4 to achieve similar results with less GPU usage.

Now, let's launch the training process

sft_trainer.train()

As we mentioned earlier, some tokenizer configurations need to be reverted after training

tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.eos_token)
tokenizer.padding_side = 'left'

tokenizer.push_to_hub(new_model_id)

Results

The loss curve of the model looks good.

image/png

For vibe-check, you can try the model here: https://huggingface.co/spaces/anakin87/Phi-3.5-mini-ITA. While our fine-tuning was focused on improving Italian performance, the model is multilingual and can handle English as well.

Official benchmark results can be found on the Open ITA LLM Leadearboard.

Model Parameters Average MMLU_IT ARC_IT HELLASWAG_IT
anakin87/Phi-3.5-mini-ITA 3.82 B 57.67 59.93 51.5 61.57
meta-llama/Meta-Llama-3.1-8B-Instruct 8.03 B 56.97 58.43 48.42 64.07
microsoft/Phi-3.5-mini-instruct 3.82 B 56.82 60.03 49.19 61.25

In short, our model's performance in Italian improved, so we can consider this experiment a success! ๐ŸŽ‰

Training took about 14 hours on a single A6000 GPU.

Based on other experiments I've done, I found similar results with just one epoch of training (versus two) and when selecting the top 25% of layers with Spectrum (versus 30%).

Conclusion

This article provided an overview of Spectrum, a technique for selecting the most informative layers of a Language Model. The parameters identified by Spectrum can be used for selective fine-tuning, leading to more efficient training that requires less time and fewer resources compared to full fine-tuning.

We then demonstrated a practical use case by fine-tuning Phi-3.5-mini-instruct using Spectrum and TRL on a mix of English and Italian data. The resulting model, Phi-3.5-mini-ITA, shows improved performance in Italian.

If you enjoyed this article, feel free to follow me on Hugging Face and LinkedIn. If you notice any errors or inaccuracies, don't hesitate to reach out.

Main References