Llama-3.1-8B-Fusion-8020

Overview

Llama-3.1-8B-Fusion-8020 is a mixed model that combines the strengths of two powerful Llama-based models: arcee-ai/Llama-3.1-SuperNova-Lite and mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated. The weights are blended in a 8:2 ratio, with 80% of the weights from SuperNova-Lite and 20% from the abliterated Meta-Llama-3.1-8B-Instruct model. Although it's a simple mix, the model is usable, and no gibberish has appeared. This is an experiment. I test the 9:1, 8:2, 7:3, 6:4 and 5:5 ratios separately to see how much impact they have on the model. All model evaluation reports will be provided subsequently.

Model Details

Key Features

  • SuperNova-Lite Contributions (80%): Llama-3.1-SuperNova-Lite is an 8B parameter model developed by Arcee.ai, based on the Llama-3.1-8B-Instruct architecture.
  • Meta-Llama-3.1-8B-Instruct-abliterated Contributions (20%): This is an uncensored version of Llama 3.1 8B Instruct created with abliteration.

Usage

You can use this mixed model in your applications by loading it with Hugging Face's transformers library:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import time

mixed_model_name = "huihui-ai/Llama-3.1-8B-Fusion-8020"

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model and tokenizer
mixed_model = AutoModelForCausalLM.from_pretrained(mixed_model_name, device_map=device, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(mixed_model_name)

# Ensure the tokenizer has pad_token_id set
tokenizer.pad_token_id = tokenizer.eos_token_id

# Input loop
print("Start inputting text for inference (type 'exit' to quit)")
while True:
    prompt = input("Enter your prompt: ")
    if prompt.lower() == "exit":
        print("Exiting inference loop.")
        break

    # Inference phase: Generate text using the modified model
    chat = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]

    # Prepare input data
    input_ids = tokenizer.apply_chat_template(
        chat, tokenize=True, add_generation_prompt=True, return_tensors="pt"
    ).to(device)

    # Use TextStreamer for streaming output
    streamer = TextStreamer(tokenizer, skip_special_tokens=True)

    # Record the start time
    start_time = time.time()

    # Generate text and stream output character by character
    outputs = mixed_model.generate(
        input_ids,
        max_new_tokens=8192,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
        streamer=streamer  # Enable streaming output
    )

    # Record the end time
    end_time = time.time()

    # Calculate the number of generated tokens
    generated_tokens = outputs[0][input_ids.shape[-1]:].shape[0]

    # Calculate the total time taken
    total_time = end_time - start_time

    # Calculate tokens generated per second
    tokens_per_second = generated_tokens / total_time

    print(f"\nGenerated {generated_tokens} tokens in total, took {total_time:.2f} seconds, generating {tokens_per_second:.2f} tokens per second.")

Evaluations

The following data has been re-evaluated and calculated as the average for each test.

Benchmark SuperNova-Lite Meta-Llama-3.1-8B-Instruct-abliterated Llama-3.1-8B-Fusion-9010 Llama-3.1-8B-Fusion-8020 Llama-3.1-8B-Fusion-7030 Llama-3.1-8B-Fusion-6040 Llama-3.1-8B-Fusion-5050
IF_Eval 82.09 76.29 82.44 82.93 83.10 82.94 82.03
MMLU Pro 35.87 33.1 35.65 35.32 34.91 34.5 33.96
TruthfulQA 64.35 53.25 62.67 61.04 59.09 57.8 56.75
BBH 49.48 44.87 48.86 48.47 48.30 48.19 47.93
GPQA 31.98 29.50 32.25 32.38 32.61 31.14 30.6

The script used for evaluation can be found inside this repository under /eval.sh, or click here

Downloads last month
4
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for huihui-ai/Llama-3.1-8B-Fusion-8020

Finetuned
(616)
this model