🐺🐦‍⬛ LLM Comparison/Test: DeepSeek-V3, QVQ-72B-Preview, Falcon3 10B, Llama 3.3 70B, Nemotron 70B in my updated MMLU-Pro CS benchmark

Community Article Published January 2, 2025

Introduction

New year, new benchmarks! Tested some new models (DeepSeek-V3, QVQ-72B-Preview, Falcon3 10B) that came out after my latest report, and some "older" ones (Llama 3.3 70B Instruct, Llama 3.1 Nemotron 70B Instruct) that I had not tested yet.

All of this is an update to my original report from December 2024 where you'll find further details about all the other (25!) models I've tested and compared in this series of MMLU-Pro CS benchmarks: LLM Comparison/Test: 25 SOTA LLMs (including QwQ) through 59 MMLU-Pro CS benchmark runs

New Models Tested

DeepSeek-V3 is THE new open-weights star, and it's a heavyweight at 671B, with 37B active parameters in its Mixture-of-Experts architecture. I tested it through the official DeepSeek API and it was quite fast (~50 tokens/s) and super cheap (66¢ for 4 runs at 1.4M tokens total).

Surprisingly, though, it didn't become the #1 local model - at least not in my MMLU-Pro CS benchmark, where it "only" scored 78%, the same as the much smaller Qwen2.5 72B and less than the even smaller QwQ 32B Preview! But it's still a great score and beats GPT-4o, Mistral Large, Llama 3.1 405B and most other models.

Plus, there are a lot of positive reports about this model - so definitely take a closer look at it (if you can run it, locally or through the API) and test it with your own use cases. This advice generally applies to all models and benchmarks!

That said, personally, I'm still on the fence as I've experienced some repetiton issues that remind me of the old days of local LLMs. There could be various explanations for this, though, so I'll keep investigating and testing it further as it certainly is a milestone for open LLMs.
Llama 3.3 70B Instruct, the latest iteration of Meta's Llama series, focused on multilinguality so its general performance doesn't differ much from its predecessors. Still, even quantized down to just 4-bit, it scored ~71%, which is a little bit better than the unquantized (!) Llama 3.1 70B Instruct and almost on par with gpt-4o-2024-11-20!

Not much else to say here, Llama has been somewhat overshadowed by the other models, especially those from China. So looking forward to what Llama 4 will bring, and hopefully soon.
Llama 3.1 Nemotron 70B Instruct is the oldest model in this batch, at 3 months old it's basically ancient in LLM terms. Still solid, though, scoring ~70% at ~4-bit, extremely close to the unquantized Llama 3.1 70B it's based on.

Not reflected in the test is how it feels when using it - like no other model I know of, it feels more like a multiple-choice dialog than a normal chat. Which may be a good or bad thing, depending on your use case. For something like a customer support bot, this style may be a perfect fit.
Like with DeepSeek-V3, I'm surprised (and even disappointed) that QVQ-72B-Preview didn't score much higher. QwQ 32B did so much better, but even with 16K max tokens, QVQ 72B didn't get any better through reasoning more. Only 70%, compared to QwQ 32B's 79% and Qwen2.5 72B's 78%.

But maybe that was to be expected, as QVQ is focused on VISUAL reasoning - which this benchmark doesn't measure. However, considering it's based on Qwen and how great both the QwQ 32B and Qwen 72B models perform, I had hoped QVQ being both 72B and reasoning would have had much more of an impact on its general performance.

So we'll have to keep waiting for a QwQ 72B to see if more parameters improve reasoning further - and by how much. But if you have a use case for visual reasoning, this is probably your best (and only) option among local models.
Falcon3 10B Instruct did surprisingly well, scoring 61%. Most small models don't even make it past the 50% threshold to get onto the chart at all (like IBM Granite 8B, which I also tested but it didn't make the cut).

Falcon3 10B even surpasses Mistral Small which at 22B is over twice as big. Definitely worth a look if you need something small but capable in English, French, Spanish or Portuguese.

About the Benchmark

The MMLU-Pro benchmark is a comprehensive evaluation of large language models across various categories, including computer science, mathematics, physics, chemistry, and more. It's designed to assess a model's ability to understand and apply knowledge across a wide range of subjects, providing a robust measure of general intelligence. While it is a multiple choice test, instead of 4 answer options like in its predecessor MMLU, there are now 10 options per question, which drastically reduces the probability of correct answers by chance. Additionally, the focus is increasingly on complex reasoning tasks rather than pure factual knowledge.

For my benchmarks, I currently limit myself to the Computer Science category with its 410 questions. This pragmatic decision is based on several factors: First, I place particular emphasis on responses from my usual work environment, since I frequently use these models in this context during my daily work. Second, with local models running on consumer hardware, there are practical constraints around computation time - a single run already takes several hours with larger models, and I generally conduct at least two runs to ensure consistency.

Unlike typical benchmarks that only report single scores, I conduct multiple test runs for each model to capture performance variability. This comprehensive approach delivers a more accurate and nuanced understanding of each model's true capabilities. By executing at least two benchmark runs per model, I establish a robust assessment of both performance levels and consistency. The results feature error bars that show standard deviation, illustrating how performance varies across different test runs.

The benchmarks for this study alone required over 70 88 hours of runtime. With additional categories or runs, the testing duration would have become so long with the available resources that the tested models would have been outdated by the time the study was completed. Therefore, establishing practical framework conditions and boundaries is essential to achieve meaningful results within a reasonable timeframe.

Detailed Results

Here's the complete table, including the previous results from the original report:

Model	HF Main Model Name	HF Draft Model Name (speculative decoding)	Size	Format	API	GPU	GPU Mem	Run	Duration	Total	%	TIGER-Lab	Correct Random Guesses	Prompt tokens	tk/s	Completion tokens	tk/s
claude-3-5-sonnet-20241022	-	-	-	-	Anthropic	-	-	1/2	31m 50s	340/410	82.93%	~= 82.44%		694458	362.78	97438	50.90
claude-3-5-sonnet-20241022	-	-	-	-	Anthropic	-	-	2/2	31m 39s	338/410	82.44%	== 82.44%		694458	364.82	97314	51.12
gemini-1.5-pro-002	-	-	-	-	Gemini	-	-	1/2	31m 7s	335/410	81.71%	> 71.22%		648675	346.82	78311	41.87
gemini-1.5-pro-002	-	-	-	-	Gemini	-	-	2/2	30m 40s	327/410	79.76%	> 71.22%		648675	351.73	76063	41.24
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384)	bartowski/QwQ-32B-Preview-exl2_8_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	38436MiB	1/2	2h 3m 30s	325/410	79.27%		0/2, 0.00%	656716	88.58	327825	44.22
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384)	bartowski/QwQ-32B-Preview-exl2_8_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	38436MiB	2/2	2h 3m 35s	324/410	79.02%			656716	88.52	343440	46.29
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache)	wolfram/Athene-V2-Chat-4.65bpw-h6-exl2	-	72B	EXL2	TabbyAPI	RTX 6000	44496MiB	1/2	2h 13m 5s	326/410	79.51%	> 73.41%		656716	82.21	142256	17.81
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache)	wolfram/Athene-V2-Chat-4.65bpw-h6-exl2	-	72B	EXL2	TabbyAPI	RTX 6000	44496MiB	2/2	2h 14m 53s	317/410	77.32%	> 73.41%		656716	81.11	143659	17.74
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache)	LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2	-	72B	EXL2	TabbyAPI	2x RTX 3090	41150MiB	1/2	3h 7m 58s	320/410	78.05%	> 74.88%		656716	58.21	139499	12.36
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache)	LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2	-	72B	EXL2	TabbyAPI	2x RTX 3090	41150MiB	2/2	3h 5m 19s	319/410	77.80%	> 74.88%		656716	59.04	138135	12.42
🆕 DeepSeek-V3	deepseek-ai/DeepSeek-V3	-	671B	HF	DeepSeek	-	-	1/4	20m 22s	320/410	78.05%			628029	512.38	66807	54.50
🆕 DeepSeek-V3	deepseek-ai/DeepSeek-V3	-	671B	HF	DeepSeek	-	-	2/4	27m 43s	320/410	78.05%			628029	376.59	66874	40.10
🆕 DeepSeek-V3	deepseek-ai/DeepSeek-V3	-	671B	HF	DeepSeek	-	-	3/4	19m 45s	319/410	77.80%			628029	528.39	64470	54.24
🆕 DeepSeek-V3	deepseek-ai/DeepSeek-V3	-	671B	HF	DeepSeek	-	-	4/4	19m 45s	319/410	77.80%			628029	375.73	69531	41.60
QwQ-32B-Preview (4.25bpw EXL2, max_tokens=16384)	bartowski/QwQ-32B-Preview-exl2_4_25	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	27636MiB	1/2	1h 56m 8s	319/410	77.80%		0/1, 0.00%	656716	94.20	374973	53.79
QwQ-32B-Preview (4.25bpw EXL2, max_tokens=16384)	bartowski/QwQ-32B-Preview-exl2_4_25	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	27636MiB	2/2	1h 55m 44s	318/410	77.56%			656716	94.45	377638	54.31
gpt-4o-2024-08-06	-	-	-	-	OpenAI	-	-	1/2	34m 54s	320/410	78.05%	~= 78.29%	1/2, 50.00%	631448	300.79	99103	47.21
gpt-4o-2024-08-06	-	-	-	-	OpenAI	-	-	2/2	42m 41s	316/410	77.07%	~< 78.29%	1/3, 33.33%	631448	246.02	98466	38.36
QwQ-32B-Preview (8.0bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_8_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	38528MiB	1/4	1h 29m 49s	324/410	79.02%		0/1, 0.00%	656716	121.70	229008	42.44
QwQ-32B-Preview (8.0bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_8_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	38528MiB	2/4	1h 32m 30s	314/410	76.59%		0/2, 0.00%	656716	118.24	239161	43.06
QwQ-32B-Preview (8.0bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_8_0	-	32B	EXL2	TabbyAPI	RTX 6000	37000MiB	3/4	2h 25m 24s	308/410	75.12%		0/2, 0.00%	656716	75.23	232208	26.60
QwQ-32B-Preview (8.0bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_8_0	-	32B	EXL2	TabbyAPI	RTX 6000	37000MiB	4/4	2h 27m 27s	305/410	74.39%		0/3, 0.00%	656716	74.19	235650	26.62
QwQ-32B-Preview-abliterated (4.5bpw EXL2, max_tokens=16384)	ibrahimkettaneh_QwQ-32B-Preview-abliterated-4.5bpw-h8-exl2	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	28556MiB	1/2	2h 10m 53s	310/410	75.61%			656716	83.59	412512	52.51
QwQ-32B-Preview-abliterated (4.5bpw EXL2, max_tokens=16384)	ibrahimkettaneh_QwQ-32B-Preview-abliterated-4.5bpw-h8-exl2	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	28556MiB	2/2	2h 25m 29s	310/410	75.61%			656716	75.20	478590	54.80
mistral-large-2407 (123B)	mistralai/Mistral-Large-Instruct-2407	-	123B	HF	Mistral	-	-	1/2	40m 23s	310/410	75.61%	> 70.24%		696798	287.13	79444	32.74
mistral-large-2407 (123B)	mistralai/Mistral-Large-Instruct-2407	-	123B	HF	Mistral	-	-	2/2	46m 55s	308/410	75.12%	> 70.24%	0/1, 0.00%	696798	247.21	75971	26.95
Llama-3.1-405B-Instruct-FP8	meta-llama/Llama-3.1-405B-Instruct-FP8	-	405B	HF	IONOS	-	-	1/2	2h 5m 28s	311/410	75.85%			648580	86.11	79191	10.51
Llama-3.1-405B-Instruct-FP8	meta-llama/Llama-3.1-405B-Instruct-FP8	-	405B	HF	IONOS	-	-	2/2	2h 10m 19s	307/410	74.88%			648580	82.90	79648	10.18
mistral-large-2411 (123B)	mistralai/Mistral-Large-Instruct-2411	-	123B	HF	Mistral	-	-	1/2	41m 46s	302/410	73.66%		1/3, 33.33%	696798	277.70	82028	32.69
mistral-large-2411 (123B)	mistralai/Mistral-Large-Instruct-2411	-	123B	HF	Mistral	-	-	2/2	32m 47s	300/410	73.17%		0/1, 0.00%	696798	353.53	77998	39.57
QwQ-32B-Preview (4.25bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_4_25	-	32B	EXL2	TabbyAPI	RTX 6000	26198MiB	1/4	1h 39m 49s	308/410	75.12%		0/1, 0.00%	656716	109.59	243552	40.64
QwQ-32B-Preview (4.25bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_4_25	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	27750MiB	2/4	1h 22m 12s	304/410	74.15%			656716	133.04	247314	50.10
QwQ-32B-Preview (4.25bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_4_25	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	27750MiB	3/4	1h 21m 39s	296/410	72.20%			656716	133.94	246020	50.18
QwQ-32B-Preview (4.25bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_4_25	-	32B	EXL2	TabbyAPI	RTX 6000	26198MiB	4/4	1h 42m 33s	294/410	71.71%			656716	106.63	250222	40.63
chatgpt-4o-latest @ 2024-11-18	-	-	-	-	OpenAI	-	-	1/2	28m 17s	302/410	73.66%	< 78.29%	2/4, 50.00%	631448	371.33	146558	86.18
chatgpt-4o-latest @ 2024-11-18	-	-	-	-	OpenAI	-	-	2/2	28m 31s	298/410	72.68%	< 78.29%	2/2, 100.00%	631448	368.19	146782	85.59
gpt-4o-2024-11-20	-	-	-	-	OpenAI	-	-	1/2	25m 35s	296/410	72.20%		1/7, 14.29%	631448	410.38	158694	103.14
gpt-4o-2024-11-20	-	-	-	-	OpenAI	-	-	2/2	26m 10s	294/410	71.71%		1/7, 14.29%	631448	400.95	160378	101.84
🆕 Llama-3.3-70B-Instruct (4.0bpw EXL2)	LoneStriker/Llama-3.3-70B-Instruct-4.0bpw-h6-exl2	-	70B	EXL2	TabbyAPI	RTX 6000	47148MiB	1/2	2h 2m 33s	293/410	71.46%			648580	88.15	87107	11.84
🆕 Llama-3.3-70B-Instruct (4.0bpw EXL2)	LoneStriker/Llama-3.3-70B-Instruct-4.0bpw-h6-exl2	-	70B	EXL2	TabbyAPI	RTX 6000	47148MiB	2/2	1h 33m 59s	293/410	71.46%			534360	94.70	89510	15.86
Llama-3.1-70B-Instruct	meta-llama/Llama-3.1-70B-Instruct	-	70B	HF	IONOS	-	-	1/2	41m 12s	291/410	70.98%	> 66.34%	3/12, 25.00%	648580	261.88	102559	41.41
Llama-3.1-70B-Instruct	meta-llama/Llama-3.1-70B-Instruct	-	70B	HF	IONOS	-	-	2/2	39m 48s	287/410	70.00%	> 66.34%	3/14, 21.43%	648580	271.12	106644	44.58
🆕 Llama-3.1-Nemotron-70B-Instruct (4.25bpw EXL2)	bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-exl2_4_25	-	70B	EXL2	TabbyAPI	RTX 6000	40104MiB	1/2	2h 13m 3s	290/410	70.73%			640380	80.18	157235	19.69
🆕 Llama-3.1-Nemotron-70B-Instruct (4.25bpw EXL2)	bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-exl2_4_25	-	70B	EXL2	TabbyAPI	RTX 6000	40104MiB	2/2	2h 13m 15s	287/410	70.00%		0/1, 0.00%	640380	80.07	157471	19.69
🆕 QVQ-72B-Preview (4.65bpw EXL2, max_tokens=16384)	wolfram/QVQ-72B-Preview-4.65bpw-h6-exl2	Qwen/Qwen2.5-Coder-0.5B-Instruct	72B	EXL2	TabbyAPI	RTX 6000	46260MiB	1/2	3h 43m 12s	290/410	70.73%		1/3, 33.33%	656716	49.02	441187	32.93
🆕 QVQ-72B-Preview (4.65bpw EXL2, max_tokens=16384)	wolfram/QVQ-72B-Preview-4.65bpw-h6-exl2	Qwen/Qwen2.5-Coder-0.5B-Instruct	72B	EXL2	TabbyAPI	RTX 6000	46260MiB	2/2	3h 47m 29s	284/410	69.27%		0/2, 0.00%	656716	48.10	450363	32.99
gemini-1.5-flash-002	-	-	-	-	Gemini	-	-	1/2	13m 19s	288/410	70.24%	> 63.41%	1/6, 16.67%	648675	808.52	80535	100.38
gemini-1.5-flash-002	-	-	-	-	Gemini	-	-	2/2	22m 30s	285/410	69.51%	> 63.41%	2/7, 28.57%	648675	479.42	80221	59.29
Llama-3.2-90B-Vision-Instruct	meta-llama/Llama-3.2-90B-Vision-Instruct	-	90B	HF	Azure	-	-	1/2	33m 6s	289/410	70.49%		4/7, 57.14%	640380	321.96	88997	44.74
Llama-3.2-90B-Vision-Instruct	meta-llama/Llama-3.2-90B-Vision-Instruct	-	90B	HF	Azure	-	-	2/2	31m 31s	281/410	68.54%		2/5, 40.00%	640380	338.10	85381	45.08
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	Qwen/Qwen2.5-Coder-3B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	45880MiB	1/7	41m 59s	289/410	70.49%			656716	260.29	92126	36.51
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	40036MiB	2/7	34m 24s	286/410	69.76%			656716	317.48	89487	43.26
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	Qwen/Qwen2.5-Coder-3B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	45880MiB	3/7	41m 27s	283/410	69.02%		0/1, 0.00%	656716	263.62	90349	36.27
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0	32B	EXL2	TabbyAPI	RTX 6000	43688MiB	4/7	42m 32s	283/410	69.02%		0/1, 0.00%	656716	256.77	90899	35.54
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0	32B	EXL2	TabbyAPI	RTX 6000	43688MiB	5/7	44m 34s	282/410	68.78%		0/1, 0.00%	656716	245.24	96470	36.03
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	-	32B	EXL2	TabbyAPI	RTX 6000	38620MiB	6/7	1h 2m 8s	282/410	68.78%			656716	175.98	92767	24.86
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	40036MiB	7/7	34m 56s	280/410	68.29%			656716	312.66	91926	43.76
QwQ-32B-Preview (3.0bpw EXL2, max_tokens=8192)	bartowski/QwQ-32B-Preview-exl2_3_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	22990MiB	1/2	1h 15m 18s	289/410	70.49%			656716	145.23	269937	59.69
QwQ-32B-Preview (3.0bpw EXL2, max_tokens=8192)	bartowski/QwQ-32B-Preview-exl2_3_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	22990MiB	2/2	1h 19m 50s	274/410	66.83%		0/2, 0.00%	656716	137.01	291818	60.88
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2)	MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2	-	123B	EXL2	TabbyAPI	RTX 6000	47068MiB	1/2	1h 26m 26s	284/410	69.27%		1/3, 33.33%	696798	134.23	79925	15.40
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2)	MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2	-	123B	EXL2	TabbyAPI	RTX 6000	47068MiB	2/2	1h 26m 10s	275/410	67.07%		0/2, 0.00%	696798	134.67	79778	15.42
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2)	turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw	-	123B	EXL2	TabbyAPI	RTX 6000	45096MiB	1/2	1h 8m 8s	271/410	66.10%	< 70.24%		696798	170.29	66670	16.29
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2)	turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw	-	123B	EXL2	TabbyAPI	RTX 6000	45096MiB	2/2	1h 10m 38s	268/410	65.37%	< 70.24%	1/3, 33.33%	696798	164.23	69182	16.31
QwQ-32B-Preview (3.0bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_3_0	-	32B	EXL2	TabbyAPI	RTX 6000	21574MiB	1/2	1h 5m 30s	268/410	65.37%		1/3, 33.33%	656716	166.95	205218	52.17
QwQ-32B-Preview (3.0bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_3_0	-	32B	EXL2	TabbyAPI	RTX 6000	21574MiB	2/2	1h 8m 44s	266/410	64.88%			656716	159.10	215616	52.24
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2)	wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2	-	123B	EXL2	TabbyAPI	RTX 6000	45096MiB	1/2	1h 11m 50s	267/410	65.12%		1/4, 25.00%	696798	161.53	70538	16.35
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2)	wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2	-	123B	EXL2	TabbyAPI	RTX 6000	45096MiB	2/2	1h 13m 50s	243/410	59.27%		0/4, 0.00%	696798	157.18	72718	16.40
🆕 Falcon3-10B-Instruct	tiiuae/Falcon3-10B-Instruct	-	10B	HF	Ollama	RTX 6000	20906MiB	1/2	35m 15s	251/410	61.22%		2/5, 40.00%	702578	331.57	75501	35.63
🆕 Falcon3-10B-Instruct	tiiuae/Falcon3-10B-Instruct	-	10B	HF	Ollama	RTX 6000	20906MiB	2/2	35m 21s	251/410	61.22%		2/5, 40.00%	702578	330.66	75501	35.53
mistral-small-2409 (22B)	mistralai/Mistral-Small-Instruct-2409	-	22B	HF	Mistral	-	-	1/2	25m 3s	243/410	59.27%	> 53.66%	1/4, 25.00%	696798	462.38	73212	48.58
mistral-small-2409 (22B)	mistralai/Mistral-Small-Instruct-2409	-	22B	HF	Mistral	-	-	2/2	20m 45s	239/410	58.29%	> 53.66%	1/4, 25.00%	696798	558.10	76017	60.89

Model: Model name (with relevant parameter and setting details)
HF Main Model Name: Full name of the tested model as listed on Hugging Face
HF Draft Model Name (speculative decoding): Draft model used for speculative decoding (if applicable)
Size: Parameter count
Format: Model format type (HF, EXL2, etc.)
API: Service provider (TabbyAPI indicates local deployment)
GPU: Graphics card used for this benchmark run
GPU Mem: VRAM allocated to model and configuration
Run: Benchmark run sequence number
Duration: Total runtime of benchmark
Total: Number of correct answers (determines ranking!)
%: Percentage of correct answers
TIGER-Lab: Comparison between TIGER-Lab (the makers of MMLU-Pro)'s CS benchmark results and mine
Correct Random Guesses: When MMLU-Pro cannot definitively identify a model's answer choice, it defaults to random guessing and reports both the number of these random guesses and their accuracy (a high proportion of random guessing indicates problems with following the response format)
Prompt tokens: Token count of input text
tk/s: Tokens processed per second
Completion tokens: Token count of generated response
tk/s: Tokens generated per second

🆕 Update 2025-01-04: Additional Insights After Further Analysis

Inspired by Teortaxes's valuable feedback on X, I performed additional analyses that revealed fascinating insights:

A key discovery emerged when comparing DeepSeek-V3 and Qwen2.5-72B-Instruct: While both models achieved identical accuracy scores of 77.93%, their response patterns differed substantially. Despite matching overall performance, they provided different answers on 101 questions! Moreover, they shared 45 incorrect responses, separate from their individual errors.

The analysis of unanswered questions yielded equally interesting results: Among the top local models (Athene-V2-Chat, DeepSeek-V3, Qwen2.5-72B-Instruct, and QwQ-32B-Preview), only 30 out of 410 questions (7.32%) received incorrect answers from all models. When expanding the analysis to include Claude and GPT-4, this number dropped to 23 questions (5.61%) that remained unsolved across all models.

This proves that the MMLU-Pro CS benchmark doesn't have a soft ceiling at 78%. If there's one, it'd rather be around 95%, confirming that this benchmark remains a robust and effective tool for evaluating LLMs now and in the foreseeable future.

🆕 Update 2025-01-05: Analyzed All Results For Unsolved Questions

After analyzing ALL results for unsolved questions across my tested models, only 10 out of 410 (2.44%) remained unsolved.

This demonstrates that the MMLU-Pro CS benchmark maintains a high ceiling and remains a valuable tool for evaluating advanced language models.

Wolfram Ravenwolf is a German AI Engineer and an internationally active consultant and renowned researcher who's particularly passionate about local language models. You can follow him on X and Bluesky, read his previous LLM tests and comparisons on HF and Reddit, check out his models on Hugging Face, tip him on Ko-fi, or book him for a consultation.

Upvote