PEFT
Safetensors
English
File size: 6,021 Bytes
5f82b06
 
 
 
 
 
 
 
 
 
 
 
 
 
79bbdac
5f82b06
 
 
 
 
79bbdac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5f82b06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79bbdac
5f82b06
 
 
 
 
 
79bbdac
5f82b06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
language: en
license: apache-2.0
---

# Shears Model Card: shears-llama-13b-50-math-heuristic-adapter

The heuristic adapter discovered from the [super-adapter](https://huggingface.co/IntelLabs/shears-llama-13b-50-math-super-adapter) fine-tuned on sparsified LLaMA-13B with some math reasoning datasets using Shears.

## Model Details

### Information

- **Model name:** shears-llama-13b-50-math-heuristic-adapter
- **Base model:** Sparsified [LLaMA-13B](https://huggingface.co/yahma/llama-13b-hf)
- **Sparsity:** 50%
- **Domain:** Math
- **Subnetwork version:** Heuristic
- **NNCF Configuration:** [nncf_shears_llama.json](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears/nncf_config/nncf_shears_llama.json)

### Sparsified Base Model

Shears employs a simple but effective pruning approach [Wanda](https://arxiv.org/abs/2306.11695) to sparsify the language model, serving as the base model.
Clone the [Wanda](https://github.com/locuslab/wanda) repo:

```bash
git clone https://github.com/locuslab/wanda.git && cd wanda && git checkout 8e8fc87 && cd ..
```

The command for unstructured sparsifying LLaMA-13B with Wanda, to achieve unstructured 50% sparsity:

```bash
python wanda/main.py \
    --model yahma/llama-13b-hf \
    --prune_method wanda \
    --sparsity_ratio 0.5 \
    --sparsity_type unstructured \
    --save wanda_out \
    --save_model shears-llama-13b-50-base
```
- `--model`: The identifier for the model on the Hugging Face model hub or local path.
- `--sparsity_ratio`: Specifies the percentage of weights to be pruned.
- `--save_model`: Specifies the directory where the pruned language model will be stored.

Refer to our [repo](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears#setup) for the environment information to run this command.

### Adapter Configuration

- **LoRA rank:** 32 (24 in the heuristic subnetwork)
- **LoRA alpha:** 64
- **LoRA target modules:** q_proj, k_proj, v_proj, up_proj, down_proj
- **LoRA rank search space:** [32, 24, 16] (for each LoRA module)

### Training Hyperparameters

- **Batch size:** 16
- **Learning rate:** 3e-4
- **Epoch:** 3

### Training Data

Unified math reasoning dataset: [math_10k.json](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/ft-training_set/math_10k.json) (collected with the training sets of GSM8K, MAWPS, and AQuA).

### Evaluation Data
[GSM8K](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/dataset/gsm8k/test.json), [AQuA](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/dataset/AQuA/test.json), [MAWPS](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/dataset/mawps/test.json), [SVAMP](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/dataset/SVAMP/test.json)


## How to use

Use our modified PEFT library (apply [patch](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears/patches/peft-modifications-for-shears-inference-usage.patch)):
```bash
git clone https://github.com/huggingface/peft.git
cd peft && git checkout v0.5.0 && git apply --ignore-space-change --ignore-whitespace peft-modifications-for-shears-inference-usage.patch && pip install -e . && cd ..
```

```python
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

def generate_prompt(instruction):
    return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request. 

                    ### Instruction:
                    {instruction}

                    ### Response:
                    """

base_model = AutoModelForCausalLM.from_pretrained("shears-llama-13b-50-base")
model = PeftModel.from_pretrained(base_model, "IntelLabs/shears-llama-13b-50-math-heuristic-adapter")
model.eval()

non_zero_params = sum([(param.data != 0).sum().item() for _, param in model.named_parameters()])
print(f"Number of all non-zero parameters: {non_zero_params}")

tokenizer = AutoTokenizer.from_pretrained("shears-llama-13b-50-base")

instruction = "Edgar eats 18 pretzels a day. If his brother eats 1/2 as many, how many does his brother eat in a week?"
prompt = generate_prompt(instruction)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(model.device)
with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_ids,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=256,
        use_cache=True,
        num_beams=4,
    )
    s = generation_output.sequences[0]
    output = tokenizer.decode(s)
print(output)

```

## Evaluation Results

| Model                 | Sparsity    | GSM8K | AQuA  | MAWPS | SVAMP | Average |
|-----------------------|-------------|-------|-------|-------|-------|---------|
| LLaMA-7B-LoRA         | -           | 37.5  | 18.9  | 79.0  | 52.1  | 46.9    |
| [**LLaMA-7B-Shears**](https://huggingface.co/IntelLabs/shears-llama-7b-50-math-heuristic-adapter)   | **50%**     | 36.1  | 22.0  | 78.6  | 44.5  | 45.3    |
| LLaMA-13B-LoRA        | -           | 47.5  | 18.5  | 83.6  | 54.6  | 51.1    |
| [**LLaMA-13B-Shears**](https://huggingface.co/IntelLabs/shears-llama-13b-50-math-heuristic-adapter)  | **50%**     | 45.1  | 22.0  | 83.2  | 53.3  | 50.9    |

## Model Sources

- **Repository:** [https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears)
- **Paper:** [Shears: Unstructured Sparsity with Neural Low-rank Adapter Search](https://arxiv.org/abs/2404.10934)

## Citation

```bash
@article{munoz2024shears,
  title = {Shears: Unstructured Sparsity with Neural Low-rank Adapter Search},
  author={J. Pablo Munoz and Jinjie Yuan and Nilesh Jain},
  journal={The 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-2024)},
  year={2024}
}
```

## License

Apache-2.0