TheBloke commited on
Commit
b59e7d2
·
1 Parent(s): 31281d8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -17
README.md CHANGED
@@ -19,14 +19,15 @@ license: other
19
 
20
  # Eric Hartford's WizardLM Uncensored Falcon 40B GGML
21
 
22
- These files are **experimental** GGML format model files for [Eric Hartford's WizardLM Uncensored Falcon 40B](https://huggingface.co/ehartford/WizardLM-Uncensored-Falcon-40b).
23
 
24
- These GGML files will **not** work in llama.cpp, text-generation-webui or KoboldCpp.
25
 
26
- They can be used from:
27
- * [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui).
28
- * The ctransformers Python library, which includes LangChain support: [ctransformers](https://github.com/marella/ctransformers).
29
- * A new fork of llama.cpp that introduced this new Falcon GGML support: [cmp-nc/ggllm.cpp](https://github.com/cmp-nct/ggllm.cpp).
 
30
 
31
  ## Repositories available
32
 
@@ -38,11 +39,7 @@ They can be used from:
38
  <!-- compatibility_ggml start -->
39
  ## Compatibility
40
 
41
- The recommended UI for these GGMLs is [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui). Preliminary CUDA GPU acceleration is provided.
42
-
43
- For use from Python code, use [ctransformers](https://github.com/marella/ctransformers). Again, with preliminary CUDA GPU acceleration
44
-
45
- Or to build cmp-nct's fork of llama.cpp with Falcon 7B support plus preliminary CUDA acceleration, please try the following steps:
46
 
47
  ```
48
  git clone https://github.com/cmp-nct/ggllm.cpp
@@ -63,17 +60,18 @@ Adjust `-t 8` (the number of CPU cores to use) according to what performs best o
63
 
64
  `-b 1` reduces batch size to 1. This slightly lowers prompt evaluation time, but frees up VRAM to load more of the model on to your GPU. If you find prompt evaluation too slow and have enough spare VRAM, you can remove this parameter.
65
 
 
66
  <!-- compatibility_ggml end -->
67
 
68
  ## Provided files
69
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
70
  | ---- | ---- | ---- | ---- | ---- | ----- |
71
- | wizard-falcon40b.ggmlv3.q2_K.bin | q2_K | 2 | 13.74 GB | 16.24 GB | Uses GGML_TYPE_Q2_K for all tensors. |
72
- | wizard-falcon40b.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 17.98 GB | 20.48 GB | Uses GGML_TYPE_Q3_K for all tensors |
73
- | wizard-falcon40b.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 23.54 GB | 26.04 GB | Uses GGML_TYPE_Q4_K for all tensors |
74
- | wizard-falcon40b.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 28.77 GB | 31.27 GB | Uses GGML_TYPE_Q5_K for all tensors |
75
- | wizard-falcon40b.ggmlv3.q6_K.bin | q6_K | 6 | 34.33 GB | 36.83 GB | Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
76
- | wizard-falcon40b.ggmlv3.q8_0.bin | q8_0 | 8 | 44.46 GB | 46.96 GB | 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
77
 
78
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
79
 
 
19
 
20
  # Eric Hartford's WizardLM Uncensored Falcon 40B GGML
21
 
22
+ These files are GGCC format model files for [Eric Hartford's WizardLM Uncensored Falcon 40B](https://huggingface.co/ehartford/WizardLM-Uncensored-Falcon-40b).
23
 
24
+ These files will **not** work in llama.cpp, text-generation-webui or KoboldCpp.
25
 
26
+ GGCC is a new format created in a new fork of llama.cpp that introduced this new Falcon GGML-based support: [cmp-nc/ggllm.cpp](https://github.com/cmp-nct/ggllm.cpp).
27
+
28
+ Currently these files will also not work with code that previously supported Falcon, such as LoLLMs Web UI and ctransformers. But support should be added soon.
29
+
30
+ For GGMLv3 files compatible with those UIs, [please see the old `ggmlv3` branch](https://huggingface.co/TheBloke/falcon-40b-instruct-GGML/tree/ggmlv3).
31
 
32
  ## Repositories available
33
 
 
39
  <!-- compatibility_ggml start -->
40
  ## Compatibility
41
 
42
+ To build cmp-nct's fork of llama.cpp with Falcon support plus CUDA acceleration, please try the following steps:
 
 
 
 
43
 
44
  ```
45
  git clone https://github.com/cmp-nct/ggllm.cpp
 
60
 
61
  `-b 1` reduces batch size to 1. This slightly lowers prompt evaluation time, but frees up VRAM to load more of the model on to your GPU. If you find prompt evaluation too slow and have enough spare VRAM, you can remove this parameter.
62
 
63
+ Please see https://github.com/cmp-nct/ggllm.cpp for further details and instructions.
64
  <!-- compatibility_ggml end -->
65
 
66
  ## Provided files
67
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
68
  | ---- | ---- | ---- | ---- | ---- | ----- |
69
+ | wizard-falcon40b.ggccv1.q2_K.bin | q2_K | 2 | 13.74 GB | 16.24 GB | Uses GGML_TYPE_Q2_K for all tensors. |
70
+ | wizard-falcon40b.ggccv1.q3_K.bin | q3_K_S | 3 | 17.98 GB | 20.48 GB | Uses GGML_TYPE_Q3_K for all tensors |
71
+ | wizard-falcon40b.ggccv1.q4_K.bin | q4_K_S | 4 | 23.54 GB | 26.04 GB | Uses GGML_TYPE_Q4_K for all tensors |
72
+ | wizard-falcon40b.ggccv1.q5_K.bin | q5_K_S | 5 | 28.77 GB | 31.27 GB | Uses GGML_TYPE_Q5_K for all tensors |
73
+ | wizard-falcon40b.ggccv1.q6_K.bin | q6_K | 6 | 34.33 GB | 36.83 GB | Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
74
+ | wizard-falcon40b.ggccv1.q8_0.bin | q8_0 | 8 | 44.46 GB | 46.96 GB | 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
75
 
76
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
77