File size: 8,094 Bytes
5f82b06 10250e6 5f82b06 53e47fd 5f82b06 53e47fd 5f82b06 53e47fd 5f82b06 53e47fd 79bbdac 5f82b06 79bbdac 5f82b06 79bbdac 5f82b06 79bbdac 5f82b06 53e47fd 0c53fff 53e47fd 5f82b06 53e47fd 5f82b06 53e47fd 5f82b06 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
---
language: en
license: apache-2.0
library_name: peft
---
# Shears Adapter Card: shears-llama-13b-50-math-heuristic-adapter
The heuristic adapter discovered from the [super-adapter](https://huggingface.co/IntelLabs/shears-llama-13b-50-math-super-adapter) fine-tuned on sparsified LLaMA-13B with some math reasoning datasets using Shears.
## Paper Abstract
Recently, several approaches successfully demonstrated that weight-sharing Neural Architecture Search (NAS) can effectively explore a search space of elastic low-rank adapters (LoRA), allowing the parameter-efficient fine-tuning (PEFT) and compression of large language models. In this paper, we introduce a novel approach called Shears, demonstrating how the integration of cost-effective sparsity and a proposed Neural Low-rank adapter Search (NLS) algorithm can further improve the efficiency of PEFT approaches. Results demonstrate the benefits of Shears compared to other methods, reaching high sparsity levels while improving or with little drop in accuracy, utilizing a single GPU for a pair of hours.
## Model Details
### Note
Please note, we only provide the model adapter and do not provide a copy of the base [yahma/llama-13b-hf](https://huggingface.co/yahma/llama-13b-hf) model or its sparsified one. Any use of this adapter requires a separate download of the base model and follow [this instruction](#sparsified-base-model) to sparse the base model.
### Information
- **Adapter name:** shears-llama-13b-50-math-heuristic-adapter
- **Base model:** Sparsified [LLaMA-13B](https://huggingface.co/yahma/llama-13b-hf)
- **Sparsity:** 50%
- **Domain:** Math
- **Subnetwork version:** Heuristic
- **NNCF Configuration:** [nncf_shears_llama.json](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears/nncf_config/nncf_shears_llama.json)
### Sparsified Base Model
Shears employs a simple but effective pruning approach [Wanda](https://arxiv.org/abs/2306.11695) to sparsify the language model, serving as the base model.
Clone the [Wanda](https://github.com/locuslab/wanda) repo:
```bash
git clone https://github.com/locuslab/wanda.git && cd wanda && git checkout 8e8fc87 && cd ..
```
The command for unstructured sparsifying LLaMA-13B with Wanda, to achieve unstructured 50% sparsity:
```bash
python wanda/main.py \
--model yahma/llama-13b-hf \
--prune_method wanda \
--sparsity_ratio 0.5 \
--sparsity_type unstructured \
--save wanda_out \
--save_model shears-llama-13b-50-base
```
- `--model`: The identifier for the model on the Hugging Face model hub or local path.
- `--sparsity_ratio`: Specifies the percentage of weights to be pruned.
- `--save_model`: Specifies the directory where the pruned language model will be stored.
Refer to our [repo](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears#setup) for the environment information to run this command.
### Adapter Configuration
- **LoRA rank:** 32 (24 in the heuristic subnetwork)
- **LoRA alpha:** 64
- **LoRA target modules:** q_proj, k_proj, v_proj, up_proj, down_proj
- **LoRA rank search space:** [32, 24, 16] (for each LoRA module)
### Training Hyperparameters
- **Batch size:** 16
- **Learning rate:** 3e-4
- **Epoch:** 3
### Training Data
Unified math reasoning dataset: [math_10k.json](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/ft-training_set/math_10k.json) (collected with the training sets of GSM8K, MAWPS, and AQuA).
### Evaluation Data
[GSM8K](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/dataset/gsm8k/test.json), [AQuA](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/dataset/AQuA/test.json), [MAWPS](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/dataset/mawps/test.json), [SVAMP](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/dataset/SVAMP/test.json)
## How to use
Use our modified PEFT library (apply [patch](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears/patches/peft-modifications-for-shears-inference-usage.patch)):
```bash
git clone https://github.com/huggingface/peft.git
cd peft && git checkout v0.5.0 && git apply --ignore-space-change --ignore-whitespace peft-modifications-for-shears-inference-usage.patch && pip install -e . && cd ..
```
```python
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
def generate_prompt(instruction):
return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:
"""
base_model = AutoModelForCausalLM.from_pretrained("shears-llama-13b-50-base")
model = PeftModel.from_pretrained(base_model, "IntelLabs/shears-llama-13b-50-math-heuristic-adapter")
model.eval()
non_zero_params = sum([(param.data != 0).sum().item() for _, param in model.named_parameters()])
print(f"Number of all non-zero parameters: {non_zero_params}")
tokenizer = AutoTokenizer.from_pretrained("shears-llama-13b-50-base")
instruction = "Edgar eats 18 pretzels a day. If his brother eats 1/2 as many, how many does his brother eat in a week?"
prompt = generate_prompt(instruction)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(model.device)
with torch.no_grad():
generation_output = model.generate(
input_ids=input_ids,
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=256,
use_cache=True,
num_beams=4,
)
s = generation_output.sequences[0]
output = tokenizer.decode(s)
print(output)
```
## Evaluation Results
| Model | Sparsity | GSM8K | AQuA | MAWPS | SVAMP | Average |
|-----------------------|-------------|-------|-------|-------|-------|---------|
| LLaMA-7B-LoRA | - | 37.5 | 18.9 | 79.0 | 52.1 | 46.9 |
| [**LLaMA-7B-Shears**](https://huggingface.co/IntelLabs/shears-llama-7b-50-math-heuristic-adapter) | **50%** | 36.1 | 22.0 | 78.6 | 44.5 | 45.3 |
| LLaMA-13B-LoRA | - | 47.5 | 18.5 | 83.6 | 54.6 | 51.1 |
| [**LLaMA-13B-Shears**](https://huggingface.co/IntelLabs/shears-llama-13b-50-math-heuristic-adapter) | **50%** | 45.1 | 22.0 | 83.2 | 53.3 | 50.9 |
## Model Sources
- **Repository:** [https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears)
- **Paper:** [Shears: Unstructured Sparsity with Neural Low-rank Adapter Search](https://arxiv.org/abs/2404.10934)
## Ethical Considerations
Intel is committed to respecting human rights and avoiding causing or contributing to adverse impacts on human rights. See [Intel’s Global Human Rights Principles](https://www.intel.com/content/dam/www/central-libraries/us/en/documents/policy-human-rights.pdf). Intel’s products and software are intended only to be used in applications that do not cause or contribute to adverse impacts on human rights.
| Ethical Considerations | Description |
| ----------- | ----------- |
| Data | The adapter was trained using the math_10k.json data mixture as described above. |
| Human life | The model is not intended to inform decisions central to human life or flourishing. |
| Mitigations | No additional risk mitigation strategies were considered during model development. |
| Risks and harms | This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm. |
| Use cases | - |
## Citation
```bash
@inproceedings{munoz2024shears,
title = {Shears: Unstructured Sparsity with Neural Low-rank Adapter Search},
author={J. Pablo Munoz and Jinjie Yuan and Nilesh Jain},
booktitle={The 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-2024)},
year={2024}
}
```
## License
Apache-2.0
|