flan-ul2: candle quants

Quants of google/flan-ul2 with candle

cargo run --example quantized-t5 --release  -- \
    --model-id pszemraj/candle-flanUL2-quantized \
    --weight-file flan-ul2-q3k.gguf \
    --prompt "Answer the following question by reasoning step by step. The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apples do they have?" \
    --temperature 0

On my laptop (CPU, running in WSL) I get: 45 tokens generated (0.48 token/s)

weights

Below are the weights/file names in this repo:

Weight File Name Quant Format Size (GB)
flan-ul2-q2k.gguf q2k 6.39
flan-ul2-q3k.gguf q3k 8.36
flan-ul2-q4k.gguf q4k 10.9
flan-ul2-q6k.gguf q6k 16

From initial testing:

  • it appears that q2k is too low precision and produces poor/incoherent output. The q3k and higher are coherent.
  • Interestingly, there is no noticeable increase in computation time (again, on CPU) when using higher precision quants. I get the same tok/sec for q3k and q6k +/- 0.02

setup

this assumes you already have rust installed

git clone https://github.com/huggingface/candle.git
cd candle
cargo build
Downloads last month
8
GGUF
Model size
19.5B params
Architecture
undefined
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for pszemraj/candle-flanUL2-quantized

Base model

google/flan-ul2
Quantized
(1)
this model