You will release a small version for consumer hardware like the v2 generation?

#35
by anon-linux-mint - opened

I think will be very usefult center this model in CHAT more than other fuctions like code and math.

This comment has been hidden

The ONLY way for ordinary people to run it is on CPU with 12 RAM slots motherboards (i don't see any other way, there won't be consumer GPUs in any near future for such VRAM sizes). Such motherboards made many years usually for Xeons, even today from Asus, Gigabyte always made them - today very fancy server ones for newest DDR5 RAM. But you can get even older their boards for DDR4, i got such for only $150 from China. In these motherboard you can get a Terabyte Ram total, GPU can work as off-load accelerator in such tools like oobabooga or LMstudio based on llama-cpp, of course there's bandwidth problem, but there's no other way. CPUs must be no less than 12 cores, latest AMD CPU with 100+ cores i'm sure will be very fast on level of good GPU, but it costs thousands. Home clusters solutions like network of certain PCs are too weird and not useful today, usually it's projects from chain of super expensive Macs on which Ram & storage is a historically painful topic.
About RAM, in such boards mostly you can use only LRDIMM Ram modules which twice in size of ordinary, used server DDR4 LRDIMM in 64Gb quite cheap was ~$90 each, 128Gb still rare and quite expensive. LRDIMMs with radiator are very hot in use, twice warmer than ordinary RAM, together in 12 RAM boards it can get into 90C degrees without additional ventilation.
Storage. This huge models are already beyond our current PC hardware limits. There's a problem of loading such huge model into RAM from storage. Even NVMe SSD speed is not enough already for such sizes and NVMe very expensive for such sizes of terabyte or more.
I've tested loading Meta Llama 3 405 billions from ordinary hard drive and loading it into RAM in q5 quality size takes 30 minutes, on NVMe kinda 10-15 minutes. Model need to be loaded only once at the start, after that it's working from RAM only.
Running THIS model. I see the problem that first quantized bf16 file are twice bigger than this original model, everything above terabyte in storage is unrealistic to run today locally. This BF16 need to be quantized into smaller quality sizes and better it to be less than terabyte. I want to try that model but on my hardware maybe the q4 oe even q3 quality is max.
When i've used Meta 405 billions it disappointed me not even by speed (it was very slow) but by censorship, it can't do much except very mundane tasks like calculation or encyclopedia. I don't see value in such model tool.
Deepseek V2.5 latest is for now the best and valuable tool, it can do programming logic & code in size of only 236 billions. It's really fast on 22 cores CPU than Meta Llama 405 billions, it uses in q8 quality kinda 270Gb Ram, but in q6 also good quality may use less and be fitted in 4 RAM motherboards which supports 256Gb Ram total.

The ONLY way for ordinary people to run it is on CPU with 12 RAM slots motherboards (i don't see any other way, there won't be consumer GPUs in any near future for such VRAM sizes). Such motherboards made many years usually for Xeons, even today from Asus, Gigabyte always made them - today very fancy server ones for newest DDR5 RAM. But you can get even older their boards for DDR4, i got such for only $150 from China. In these motherboard you can get a Terabyte Ram total, GPU can work as off-load accelerator in such tools like oobabooga or LMstudio based on llama-cpp, of course there's bandwidth problem, but there's no other way. CPUs must be no less than 12 cores, latest AMD CPU with 100+ cores i'm sure will be very fast on level of good GPU, but it costs thousands. Home clusters solutions like network of certain PCs are too weird and not useful today, usually it's projects from chain of super expensive Macs on which Ram & storage is a historically painful topic.
About RAM, in such boards mostly you can use only LRDIMM Ram modules which twice in size of ordinary, used server DDR4 LRDIMM in 64Gb quite cheap was ~$90 each, 128Gb still rare and quite expensive. LRDIMMs with radiator are very hot in use, twice warmer than ordinary RAM, together in 12 RAM boards it can get into 90C degrees without additional ventilation.
Storage. This huge models are already beyond our current PC hardware limits. There's a problem of loading such huge model into RAM from storage. Even NVMe SSD speed is not enough already for such sizes and NVMe very expensive for such sizes of terabyte or more.
I've tested loading Meta Llama 3 405 billions from ordinary hard drive and loading it into RAM in q5 quality size takes 30 minutes, on NVMe kinda 10-15 minutes. Model need to be loaded only once at the start, after that it's working from RAM only.
Running THIS model. I see the problem that first quantized bf16 file are twice bigger than this original model, everything above terabyte in storage is unrealistic to run today locally. This BF16 need to be quantized into smaller quality sizes and better it to be less than terabyte. I want to try that model but on my hardware maybe the q4 oe even q3 quality is max.
When i've used Meta 405 billions it disappointed me not even by speed (it was very slow) but by censorship, it can't do much except very mundane tasks like calculation or encyclopedia. I don't see value in such model tool.
Deepseek V2.5 latest is for now the best and valuable tool, it can do programming logic & code in size of only 236 billions. It's really fast on 22 cores CPU than Meta Llama 405 billions, it uses in q8 quality kinda 270Gb Ram, but in q6 also good quality may use less and be fitted in 4 RAM motherboards which supports 256Gb Ram total.

that's a little bit crazy 😂, I think most of those who need a small model are possibly can only afford 16GB or less VRAM, a 16B model(like deepseek v2 lite) plus Q4 gguf might be a good option

I would love deepseek v3 to run on a single h100

This comment has been hidden

Sign up or log in to comment