Model Card for mamba-2.8b-slimpj-OpenOrca_1ep

This is a finetune of mamba-2.8b-slimpj for instruction following using the OpenOrca dataset.

Model Details

Model Description

This is a finetune of the mamba reference model mamba-2.8b-slimpj from the paper https://arxiv.org/abs/2312.00752

It has been fine-tuned for instruction following using the OpenOrca dataset and training for 1 epoch.

Uses

This model is intended to evaluate fine-tuning results on mamba models.

Usage

Prompt structure

The prompt structure used in fine-tuning is alpaca format:

"### Human:\n%question%\n\n### AI response:\n%response%"

Training Details

Training Data

https://huggingface.co/datasets/Open-Orca/OpenOrca

Training Procedure

Trained using text-generation-webui with code from the mamba_ssm pull request.

Training Hyperparameters

  • Training regime: Trained in bfloat16 with the following parameters:
{
  "trained_model_name": "mamba-2.8b-slimpj-OpenOrc_1ep",
  "save_steps": 500000.0,
  "micro_batch_size": 4,
  "batch_size": 128,
  "epochs": 1.0,
  "learning_rate": "3e-4",
  "lr_scheduler_type": "linear",
  "cutoff_len": 256,
  "dataset": "OpenOrca",
  "eval_dataset": "None",
  "format": "openorca-format",
  "warmup_steps": 100.0,
  "optimizer": "paged_adamw_8bit",
  "hard_cut_string": "\\n\\n\\n",
  "add_eos_token": false,
  "min_chars": 0.0,
}

Reported train_loss was 0.6762700151924311

Results

lm-evaluation-harness results for final model

mamba_ssm (pretrained=mamba-2.8b-slimpj-OpenOrca), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (32)

Tasks Version Filter n-shot Metric Value Stderr
arc_challenge 1 none 0 acc 0.2594 ± 0.0128
none 0 acc_norm 0.2935 ± 0.0133
arc_easy 1 none 0 acc 0.4390 ± 0.0102
none 0 acc_norm 0.4032 ± 0.0101
boolq 2 none 0 acc 0.5801 ± 0.0086
lambada_openai 1 none 0 perplexity 27.8582 ± 1.1183
none 0 acc 0.3683 ± 0.0067
openbookqa 1 none 0 acc 0.2500 ± 0.0194
none 0 acc_norm 0.3700 ± 0.0216
piqa 1 none 0 acc 0.6817 ± 0.0109
none 0 acc_norm 0.6839 ± 0.0108
winogrande 1 none 0 acc 0.5770 ± 0.0139

lm-evaluation-harness results after half epoch

mamba_ssm (pretrained=mamba-2.8b-slimpj-OpenOrca_1ep-checkpoints/checkpoint-500000), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (32)

Tasks Version Filter n-shot Metric Value Stderr
arc_challenge 1 none 0 acc 0.2602 ± 0.0128
none 0 acc_norm 0.2833 ± 0.0132
arc_easy 1 none 0 acc 0.4533 ± 0.0102
none 0 acc_norm 0.4125 ± 0.0101
boolq 2 none 0 acc 0.4095 ± 0.0086
lambada_openai 1 none 0 perplexity 30.4832 ± 1.2403
none 0 acc 0.3551 ± 0.0067
openbookqa 1 none 0 acc 0.2420 ± 0.0192
none 0 acc_norm 0.3640 ± 0.0215
piqa 1 none 0 acc 0.6812 ± 0.0109
none 0 acc_norm 0.6730 ± 0.0109
winogrande 1 none 0 acc 0.5588 ± 0.0140

Reference lm-evaluation-harness results for the base model mamba-2.8b-slimpj without fine-tuning

mamba_ssm (pretrained=mamba-2.8b-slimpj), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (32)

Tasks Version Filter n-shot Metric Value Stderr
arc_challenge 1 none 0 acc 0.3882 ± 0.0142
none 0 acc_norm 0.4155 ± 0.0144
arc_easy 1 none 0 acc 0.7264 ± 0.0091
none 0 acc_norm 0.6814 ± 0.0096
boolq 2 none 0 acc 0.7107 ± 0.0079
lambada_openai 1 none 0 perplexity 5.8770 ± 0.1881
none 0 acc 0.6427 ± 0.0067
openbookqa 1 none 0 acc 0.2860 ± 0.0202
none 0 acc_norm 0.3980 ± 0.0219
piqa 1 none 0 acc 0.7709 ± 0.0098
none 0 acc_norm 0.7813 ± 0.0096
winogrande 1 none 0 acc 0.6614 ± 0.0133

Summary

The models measured perplexity and accuracy got worse, but it's known that that can be an effect of fine-tuning. Perplexity and accuracy improved in the second half of the training, so it's likely that the inital worsening was caused by forcing a prompt structure onto the base model, which was trained only on unstructured text.

The answer quality as percieved by users is yet to be evaluated.

Environmental Impact

  • Hardware Type: RTX 3090
  • Hours used: 118
Downloads last month
16
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.