Update README.md
Browse files
README.md
CHANGED
@@ -19,14 +19,15 @@ license: other
|
|
19 |
|
20 |
# Eric Hartford's WizardLM Uncensored Falcon 40B GGML
|
21 |
|
22 |
-
These files are
|
23 |
|
24 |
-
These
|
25 |
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
|
|
30 |
|
31 |
## Repositories available
|
32 |
|
@@ -38,11 +39,7 @@ They can be used from:
|
|
38 |
<!-- compatibility_ggml start -->
|
39 |
## Compatibility
|
40 |
|
41 |
-
|
42 |
-
|
43 |
-
For use from Python code, use [ctransformers](https://github.com/marella/ctransformers). Again, with preliminary CUDA GPU acceleration
|
44 |
-
|
45 |
-
Or to build cmp-nct's fork of llama.cpp with Falcon 7B support plus preliminary CUDA acceleration, please try the following steps:
|
46 |
|
47 |
```
|
48 |
git clone https://github.com/cmp-nct/ggllm.cpp
|
@@ -63,17 +60,18 @@ Adjust `-t 8` (the number of CPU cores to use) according to what performs best o
|
|
63 |
|
64 |
`-b 1` reduces batch size to 1. This slightly lowers prompt evaluation time, but frees up VRAM to load more of the model on to your GPU. If you find prompt evaluation too slow and have enough spare VRAM, you can remove this parameter.
|
65 |
|
|
|
66 |
<!-- compatibility_ggml end -->
|
67 |
|
68 |
## Provided files
|
69 |
| Name | Quant method | Bits | Size | Max RAM required | Use case |
|
70 |
| ---- | ---- | ---- | ---- | ---- | ----- |
|
71 |
-
| wizard-falcon40b.
|
72 |
-
| wizard-falcon40b.
|
73 |
-
| wizard-falcon40b.
|
74 |
-
| wizard-falcon40b.
|
75 |
-
| wizard-falcon40b.
|
76 |
-
| wizard-falcon40b.
|
77 |
|
78 |
**Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
|
79 |
|
|
|
19 |
|
20 |
# Eric Hartford's WizardLM Uncensored Falcon 40B GGML
|
21 |
|
22 |
+
These files are GGCC format model files for [Eric Hartford's WizardLM Uncensored Falcon 40B](https://huggingface.co/ehartford/WizardLM-Uncensored-Falcon-40b).
|
23 |
|
24 |
+
These files will **not** work in llama.cpp, text-generation-webui or KoboldCpp.
|
25 |
|
26 |
+
GGCC is a new format created in a new fork of llama.cpp that introduced this new Falcon GGML-based support: [cmp-nc/ggllm.cpp](https://github.com/cmp-nct/ggllm.cpp).
|
27 |
+
|
28 |
+
Currently these files will also not work with code that previously supported Falcon, such as LoLLMs Web UI and ctransformers. But support should be added soon.
|
29 |
+
|
30 |
+
For GGMLv3 files compatible with those UIs, [please see the old `ggmlv3` branch](https://huggingface.co/TheBloke/falcon-40b-instruct-GGML/tree/ggmlv3).
|
31 |
|
32 |
## Repositories available
|
33 |
|
|
|
39 |
<!-- compatibility_ggml start -->
|
40 |
## Compatibility
|
41 |
|
42 |
+
To build cmp-nct's fork of llama.cpp with Falcon support plus CUDA acceleration, please try the following steps:
|
|
|
|
|
|
|
|
|
43 |
|
44 |
```
|
45 |
git clone https://github.com/cmp-nct/ggllm.cpp
|
|
|
60 |
|
61 |
`-b 1` reduces batch size to 1. This slightly lowers prompt evaluation time, but frees up VRAM to load more of the model on to your GPU. If you find prompt evaluation too slow and have enough spare VRAM, you can remove this parameter.
|
62 |
|
63 |
+
Please see https://github.com/cmp-nct/ggllm.cpp for further details and instructions.
|
64 |
<!-- compatibility_ggml end -->
|
65 |
|
66 |
## Provided files
|
67 |
| Name | Quant method | Bits | Size | Max RAM required | Use case |
|
68 |
| ---- | ---- | ---- | ---- | ---- | ----- |
|
69 |
+
| wizard-falcon40b.ggccv1.q2_K.bin | q2_K | 2 | 13.74 GB | 16.24 GB | Uses GGML_TYPE_Q2_K for all tensors. |
|
70 |
+
| wizard-falcon40b.ggccv1.q3_K.bin | q3_K_S | 3 | 17.98 GB | 20.48 GB | Uses GGML_TYPE_Q3_K for all tensors |
|
71 |
+
| wizard-falcon40b.ggccv1.q4_K.bin | q4_K_S | 4 | 23.54 GB | 26.04 GB | Uses GGML_TYPE_Q4_K for all tensors |
|
72 |
+
| wizard-falcon40b.ggccv1.q5_K.bin | q5_K_S | 5 | 28.77 GB | 31.27 GB | Uses GGML_TYPE_Q5_K for all tensors |
|
73 |
+
| wizard-falcon40b.ggccv1.q6_K.bin | q6_K | 6 | 34.33 GB | 36.83 GB | Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
|
74 |
+
| wizard-falcon40b.ggccv1.q8_0.bin | q8_0 | 8 | 44.46 GB | 46.96 GB | 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
|
75 |
|
76 |
**Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
|
77 |
|