πΊπ¦ββ¬ LLM Comparison/Test: DeepSeek-V3, QVQ-72B-Preview, Falcon3 10B, Llama 3.3 70B, Nemotron 70B in my updated MMLU-Pro CS benchmark
- π Update 2025-01-04: Additional Insights After Further Analysis
- π Update 2025-01-05: Analyzed All Results For Unsolved Questions
Introduction
New year, new benchmarks! Tested some new models (DeepSeek-V3, QVQ-72B-Preview, Falcon3 10B) that came out after my latest report, and some "older" ones (Llama 3.3 70B Instruct, Llama 3.1 Nemotron 70B Instruct) that I had not tested yet.
All of this is an update to my original report from December 2024 where you'll find further details about all the other (25!) models I've tested and compared in this series of MMLU-Pro CS benchmarks: LLM Comparison/Test: 25 SOTA LLMs (including QwQ) through 59 MMLU-Pro CS benchmark runs
New Models Tested
DeepSeek-V3 is THE new open-weights star, and it's a heavyweight at 671B, with 37B active parameters in its Mixture-of-Experts architecture. I tested it through the official DeepSeek API and it was quite fast (~50 tokens/s) and super cheap (66Β’ for 4 runs at 1.4M tokens total).
Surprisingly, though, it didn't become the #1 local model - at least not in my MMLU-Pro CS benchmark, where it "only" scored 78%, the same as the much smaller Qwen2.5 72B and less than the even smaller QwQ 32B Preview! But it's still a great score and beats GPT-4o, Mistral Large, Llama 3.1 405B and most other models.
Plus, there are a lot of positive reports about this model - so definitely take a closer look at it (if you can run it, locally or through the API) and test it with your own use cases. This advice generally applies to all models and benchmarks!
That said, personally, I'm still on the fence as I've experienced some repetiton issues that remind me of the old days of local LLMs. There could be various explanations for this, though, so I'll keep investigating and testing it further as it certainly is a milestone for open LLMs.
Llama 3.3 70B Instruct, the latest iteration of Meta's Llama series, focused on multilinguality so its general performance doesn't differ much from its predecessors. Still, even quantized down to just 4-bit, it scored ~71%, which is a little bit better than the unquantized (!) Llama 3.1 70B Instruct and almost on par with gpt-4o-2024-11-20!
Not much else to say here, Llama has been somewhat overshadowed by the other models, especially those from China. So looking forward to what Llama 4 will bring, and hopefully soon.
Llama 3.1 Nemotron 70B Instruct is the oldest model in this batch, at 3 months old it's basically ancient in LLM terms. Still solid, though, scoring ~70% at ~4-bit, extremely close to the unquantized Llama 3.1 70B it's based on.
Not reflected in the test is how it feels when using it - like no other model I know of, it feels more like a multiple-choice dialog than a normal chat. Which may be a good or bad thing, depending on your use case. For something like a customer support bot, this style may be a perfect fit.
Like with DeepSeek-V3, I'm surprised (and even disappointed) that QVQ-72B-Preview didn't score much higher. QwQ 32B did so much better, but even with 16K max tokens, QVQ 72B didn't get any better through reasoning more. Only 70%, compared to QwQ 32B's 79% and Qwen2.5 72B's 78%.
But maybe that was to be expected, as QVQ is focused on VISUAL reasoning - which this benchmark doesn't measure. However, considering it's based on Qwen and how great both the QwQ 32B and Qwen 72B models perform, I had hoped QVQ being both 72B and reasoning would have had much more of an impact on its general performance.
So we'll have to keep waiting for a QwQ 72B to see if more parameters improve reasoning further - and by how much. But if you have a use case for visual reasoning, this is probably your best (and only) option among local models.
Falcon3 10B Instruct did surprisingly well, scoring 61%. Most small models don't even make it past the 50% threshold to get onto the chart at all (like IBM Granite 8B, which I also tested but it didn't make the cut).
Falcon3 10B even surpasses Mistral Small which at 22B is over twice as big. Definitely worth a look if you need something small but capable in English, French, Spanish or Portuguese.
About the Benchmark
The MMLU-Pro benchmark is a comprehensive evaluation of large language models across various categories, including computer science, mathematics, physics, chemistry, and more. It's designed to assess a model's ability to understand and apply knowledge across a wide range of subjects, providing a robust measure of general intelligence. While it is a multiple choice test, instead of 4 answer options like in its predecessor MMLU, there are now 10 options per question, which drastically reduces the probability of correct answers by chance. Additionally, the focus is increasingly on complex reasoning tasks rather than pure factual knowledge.
For my benchmarks, I currently limit myself to the Computer Science category with its 410 questions. This pragmatic decision is based on several factors: First, I place particular emphasis on responses from my usual work environment, since I frequently use these models in this context during my daily work. Second, with local models running on consumer hardware, there are practical constraints around computation time - a single run already takes several hours with larger models, and I generally conduct at least two runs to ensure consistency.
Unlike typical benchmarks that only report single scores, I conduct multiple test runs for each model to capture performance variability. This comprehensive approach delivers a more accurate and nuanced understanding of each model's true capabilities. By executing at least two benchmark runs per model, I establish a robust assessment of both performance levels and consistency. The results feature error bars that show standard deviation, illustrating how performance varies across different test runs.
The benchmarks for this study alone required over 70 88 hours of runtime. With additional categories or runs, the testing duration would have become so long with the available resources that the tested models would have been outdated by the time the study was completed. Therefore, establishing practical framework conditions and boundaries is essential to achieve meaningful results within a reasonable timeframe.
Detailed Results
Here's the complete table, including the previous results from the original report:
Model | HF Main Model Name | HF Draft Model Name (speculative decoding) | Size | Format | API | GPU | GPU Mem | Run | Duration | Total | % | TIGER-Lab | Correct Random Guesses | Prompt tokens | tk/s | Completion tokens | tk/s |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
claude-3-5-sonnet-20241022 | - | - | - | - | Anthropic | - | - | 1/2 | 31m 50s | 340/410 | 82.93% | ~= 82.44% | 694458 | 362.78 | 97438 | 50.90 | |
claude-3-5-sonnet-20241022 | - | - | - | - | Anthropic | - | - | 2/2 | 31m 39s | 338/410 | 82.44% | == 82.44% | 694458 | 364.82 | 97314 | 51.12 | |
gemini-1.5-pro-002 | - | - | - | - | Gemini | - | - | 1/2 | 31m 7s | 335/410 | 81.71% | > 71.22% | 648675 | 346.82 | 78311 | 41.87 | |
gemini-1.5-pro-002 | - | - | - | - | Gemini | - | - | 2/2 | 30m 40s | 327/410 | 79.76% | > 71.22% | 648675 | 351.73 | 76063 | 41.24 | |
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384) | bartowski/QwQ-32B-Preview-exl2_8_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 38436MiB | 1/2 | 2h 3m 30s | 325/410 | 79.27% | 0/2, 0.00% | 656716 | 88.58 | 327825 | 44.22 | |
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384) | bartowski/QwQ-32B-Preview-exl2_8_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 38436MiB | 2/2 | 2h 3m 35s | 324/410 | 79.02% | 656716 | 88.52 | 343440 | 46.29 | ||
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache) | wolfram/Athene-V2-Chat-4.65bpw-h6-exl2 | - | 72B | EXL2 | TabbyAPI | RTX 6000 | 44496MiB | 1/2 | 2h 13m 5s | 326/410 | 79.51% | > 73.41% | 656716 | 82.21 | 142256 | 17.81 | |
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache) | wolfram/Athene-V2-Chat-4.65bpw-h6-exl2 | - | 72B | EXL2 | TabbyAPI | RTX 6000 | 44496MiB | 2/2 | 2h 14m 53s | 317/410 | 77.32% | > 73.41% | 656716 | 81.11 | 143659 | 17.74 | |
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache) | LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2 | - | 72B | EXL2 | TabbyAPI | 2x RTX 3090 | 41150MiB | 1/2 | 3h 7m 58s | 320/410 | 78.05% | > 74.88% | 656716 | 58.21 | 139499 | 12.36 | |
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache) | LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2 | - | 72B | EXL2 | TabbyAPI | 2x RTX 3090 | 41150MiB | 2/2 | 3h 5m 19s | 319/410 | 77.80% | > 74.88% | 656716 | 59.04 | 138135 | 12.42 | |
π DeepSeek-V3 | deepseek-ai/DeepSeek-V3 | - | 671B | HF | DeepSeek | - | - | 1/4 | 20m 22s | 320/410 | 78.05% | 628029 | 512.38 | 66807 | 54.50 | ||
π DeepSeek-V3 | deepseek-ai/DeepSeek-V3 | - | 671B | HF | DeepSeek | - | - | 2/4 | 27m 43s | 320/410 | 78.05% | 628029 | 376.59 | 66874 | 40.10 | ||
π DeepSeek-V3 | deepseek-ai/DeepSeek-V3 | - | 671B | HF | DeepSeek | - | - | 3/4 | 19m 45s | 319/410 | 77.80% | 628029 | 528.39 | 64470 | 54.24 | ||
π DeepSeek-V3 | deepseek-ai/DeepSeek-V3 | - | 671B | HF | DeepSeek | - | - | 4/4 | 19m 45s | 319/410 | 77.80% | 628029 | 375.73 | 69531 | 41.60 | ||
QwQ-32B-Preview (4.25bpw EXL2, max_tokens=16384) | bartowski/QwQ-32B-Preview-exl2_4_25 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 27636MiB | 1/2 | 1h 56m 8s | 319/410 | 77.80% | 0/1, 0.00% | 656716 | 94.20 | 374973 | 53.79 | |
QwQ-32B-Preview (4.25bpw EXL2, max_tokens=16384) | bartowski/QwQ-32B-Preview-exl2_4_25 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 27636MiB | 2/2 | 1h 55m 44s | 318/410 | 77.56% | 656716 | 94.45 | 377638 | 54.31 | ||
gpt-4o-2024-08-06 | - | - | - | - | OpenAI | - | - | 1/2 | 34m 54s | 320/410 | 78.05% | ~= 78.29% | 1/2, 50.00% | 631448 | 300.79 | 99103 | 47.21 |
gpt-4o-2024-08-06 | - | - | - | - | OpenAI | - | - | 2/2 | 42m 41s | 316/410 | 77.07% | ~< 78.29% | 1/3, 33.33% | 631448 | 246.02 | 98466 | 38.36 |
QwQ-32B-Preview (8.0bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_8_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 38528MiB | 1/4 | 1h 29m 49s | 324/410 | 79.02% | 0/1, 0.00% | 656716 | 121.70 | 229008 | 42.44 | |
QwQ-32B-Preview (8.0bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_8_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 38528MiB | 2/4 | 1h 32m 30s | 314/410 | 76.59% | 0/2, 0.00% | 656716 | 118.24 | 239161 | 43.06 | |
QwQ-32B-Preview (8.0bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_8_0 | - | 32B | EXL2 | TabbyAPI | RTX 6000 | 37000MiB | 3/4 | 2h 25m 24s | 308/410 | 75.12% | 0/2, 0.00% | 656716 | 75.23 | 232208 | 26.60 | |
QwQ-32B-Preview (8.0bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_8_0 | - | 32B | EXL2 | TabbyAPI | RTX 6000 | 37000MiB | 4/4 | 2h 27m 27s | 305/410 | 74.39% | 0/3, 0.00% | 656716 | 74.19 | 235650 | 26.62 | |
QwQ-32B-Preview-abliterated (4.5bpw EXL2, max_tokens=16384) | ibrahimkettaneh_QwQ-32B-Preview-abliterated-4.5bpw-h8-exl2 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 28556MiB | 1/2 | 2h 10m 53s | 310/410 | 75.61% | 656716 | 83.59 | 412512 | 52.51 | ||
QwQ-32B-Preview-abliterated (4.5bpw EXL2, max_tokens=16384) | ibrahimkettaneh_QwQ-32B-Preview-abliterated-4.5bpw-h8-exl2 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 28556MiB | 2/2 | 2h 25m 29s | 310/410 | 75.61% | 656716 | 75.20 | 478590 | 54.80 | ||
mistral-large-2407 (123B) | mistralai/Mistral-Large-Instruct-2407 | - | 123B | HF | Mistral | - | - | 1/2 | 40m 23s | 310/410 | 75.61% | > 70.24% | 696798 | 287.13 | 79444 | 32.74 | |
mistral-large-2407 (123B) | mistralai/Mistral-Large-Instruct-2407 | - | 123B | HF | Mistral | - | - | 2/2 | 46m 55s | 308/410 | 75.12% | > 70.24% | 0/1, 0.00% | 696798 | 247.21 | 75971 | 26.95 |
Llama-3.1-405B-Instruct-FP8 | meta-llama/Llama-3.1-405B-Instruct-FP8 | - | 405B | HF | IONOS | - | - | 1/2 | 2h 5m 28s | 311/410 | 75.85% | 648580 | 86.11 | 79191 | 10.51 | ||
Llama-3.1-405B-Instruct-FP8 | meta-llama/Llama-3.1-405B-Instruct-FP8 | - | 405B | HF | IONOS | - | - | 2/2 | 2h 10m 19s | 307/410 | 74.88% | 648580 | 82.90 | 79648 | 10.18 | ||
mistral-large-2411 (123B) | mistralai/Mistral-Large-Instruct-2411 | - | 123B | HF | Mistral | - | - | 1/2 | 41m 46s | 302/410 | 73.66% | 1/3, 33.33% | 696798 | 277.70 | 82028 | 32.69 | |
mistral-large-2411 (123B) | mistralai/Mistral-Large-Instruct-2411 | - | 123B | HF | Mistral | - | - | 2/2 | 32m 47s | 300/410 | 73.17% | 0/1, 0.00% | 696798 | 353.53 | 77998 | 39.57 | |
QwQ-32B-Preview (4.25bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_4_25 | - | 32B | EXL2 | TabbyAPI | RTX 6000 | 26198MiB | 1/4 | 1h 39m 49s | 308/410 | 75.12% | 0/1, 0.00% | 656716 | 109.59 | 243552 | 40.64 | |
QwQ-32B-Preview (4.25bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_4_25 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 27750MiB | 2/4 | 1h 22m 12s | 304/410 | 74.15% | 656716 | 133.04 | 247314 | 50.10 | ||
QwQ-32B-Preview (4.25bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_4_25 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 27750MiB | 3/4 | 1h 21m 39s | 296/410 | 72.20% | 656716 | 133.94 | 246020 | 50.18 | ||
QwQ-32B-Preview (4.25bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_4_25 | - | 32B | EXL2 | TabbyAPI | RTX 6000 | 26198MiB | 4/4 | 1h 42m 33s | 294/410 | 71.71% | 656716 | 106.63 | 250222 | 40.63 | ||
chatgpt-4o-latest @ 2024-11-18 | - | - | - | - | OpenAI | - | - | 1/2 | 28m 17s | 302/410 | 73.66% | < 78.29% | 2/4, 50.00% | 631448 | 371.33 | 146558 | 86.18 |
chatgpt-4o-latest @ 2024-11-18 | - | - | - | - | OpenAI | - | - | 2/2 | 28m 31s | 298/410 | 72.68% | < 78.29% | 2/2, 100.00% | 631448 | 368.19 | 146782 | 85.59 |
gpt-4o-2024-11-20 | - | - | - | - | OpenAI | - | - | 1/2 | 25m 35s | 296/410 | 72.20% | 1/7, 14.29% | 631448 | 410.38 | 158694 | 103.14 | |
gpt-4o-2024-11-20 | - | - | - | - | OpenAI | - | - | 2/2 | 26m 10s | 294/410 | 71.71% | 1/7, 14.29% | 631448 | 400.95 | 160378 | 101.84 | |
π Llama-3.3-70B-Instruct (4.0bpw EXL2) | LoneStriker/Llama-3.3-70B-Instruct-4.0bpw-h6-exl2 | - | 70B | EXL2 | TabbyAPI | RTX 6000 | 47148MiB | 1/2 | 2h 2m 33s | 293/410 | 71.46% | 648580 | 88.15 | 87107 | 11.84 | ||
π Llama-3.3-70B-Instruct (4.0bpw EXL2) | LoneStriker/Llama-3.3-70B-Instruct-4.0bpw-h6-exl2 | - | 70B | EXL2 | TabbyAPI | RTX 6000 | 47148MiB | 2/2 | 1h 33m 59s | 293/410 | 71.46% | 534360 | 94.70 | 89510 | 15.86 | ||
Llama-3.1-70B-Instruct | meta-llama/Llama-3.1-70B-Instruct | - | 70B | HF | IONOS | - | - | 1/2 | 41m 12s | 291/410 | 70.98% | > 66.34% | 3/12, 25.00% | 648580 | 261.88 | 102559 | 41.41 |
Llama-3.1-70B-Instruct | meta-llama/Llama-3.1-70B-Instruct | - | 70B | HF | IONOS | - | - | 2/2 | 39m 48s | 287/410 | 70.00% | > 66.34% | 3/14, 21.43% | 648580 | 271.12 | 106644 | 44.58 |
π Llama-3.1-Nemotron-70B-Instruct (4.25bpw EXL2) | bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-exl2_4_25 | - | 70B | EXL2 | TabbyAPI | RTX 6000 | 40104MiB | 1/2 | 2h 13m 3s | 290/410 | 70.73% | 640380 | 80.18 | 157235 | 19.69 | ||
π Llama-3.1-Nemotron-70B-Instruct (4.25bpw EXL2) | bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-exl2_4_25 | - | 70B | EXL2 | TabbyAPI | RTX 6000 | 40104MiB | 2/2 | 2h 13m 15s | 287/410 | 70.00% | 0/1, 0.00% | 640380 | 80.07 | 157471 | 19.69 | |
π QVQ-72B-Preview (4.65bpw EXL2, max_tokens=16384) | wolfram/QVQ-72B-Preview-4.65bpw-h6-exl2 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 72B | EXL2 | TabbyAPI | RTX 6000 | 46260MiB | 1/2 | 3h 43m 12s | 290/410 | 70.73% | 1/3, 33.33% | 656716 | 49.02 | 441187 | 32.93 | |
π QVQ-72B-Preview (4.65bpw EXL2, max_tokens=16384) | wolfram/QVQ-72B-Preview-4.65bpw-h6-exl2 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 72B | EXL2 | TabbyAPI | RTX 6000 | 46260MiB | 2/2 | 3h 47m 29s | 284/410 | 69.27% | 0/2, 0.00% | 656716 | 48.10 | 450363 | 32.99 | |
gemini-1.5-flash-002 | - | - | - | - | Gemini | - | - | 1/2 | 13m 19s | 288/410 | 70.24% | > 63.41% | 1/6, 16.67% | 648675 | 808.52 | 80535 | 100.38 |
gemini-1.5-flash-002 | - | - | - | - | Gemini | - | - | 2/2 | 22m 30s | 285/410 | 69.51% | > 63.41% | 2/7, 28.57% | 648675 | 479.42 | 80221 | 59.29 |
Llama-3.2-90B-Vision-Instruct | meta-llama/Llama-3.2-90B-Vision-Instruct | - | 90B | HF | Azure | - | - | 1/2 | 33m 6s | 289/410 | 70.49% | 4/7, 57.14% | 640380 | 321.96 | 88997 | 44.74 | |
Llama-3.2-90B-Vision-Instruct | meta-llama/Llama-3.2-90B-Vision-Instruct | - | 90B | HF | Azure | - | - | 2/2 | 31m 31s | 281/410 | 68.54% | 2/5, 40.00% | 640380 | 338.10 | 85381 | 45.08 | |
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | Qwen/Qwen2.5-Coder-3B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 45880MiB | 1/7 | 41m 59s | 289/410 | 70.49% | 656716 | 260.29 | 92126 | 36.51 | ||
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 40036MiB | 2/7 | 34m 24s | 286/410 | 69.76% | 656716 | 317.48 | 89487 | 43.26 | ||
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | Qwen/Qwen2.5-Coder-3B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 45880MiB | 3/7 | 41m 27s | 283/410 | 69.02% | 0/1, 0.00% | 656716 | 263.62 | 90349 | 36.27 | |
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0 | 32B | EXL2 | TabbyAPI | RTX 6000 | 43688MiB | 4/7 | 42m 32s | 283/410 | 69.02% | 0/1, 0.00% | 656716 | 256.77 | 90899 | 35.54 | |
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0 | 32B | EXL2 | TabbyAPI | RTX 6000 | 43688MiB | 5/7 | 44m 34s | 282/410 | 68.78% | 0/1, 0.00% | 656716 | 245.24 | 96470 | 36.03 | |
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | - | 32B | EXL2 | TabbyAPI | RTX 6000 | 38620MiB | 6/7 | 1h 2m 8s | 282/410 | 68.78% | 656716 | 175.98 | 92767 | 24.86 | ||
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 40036MiB | 7/7 | 34m 56s | 280/410 | 68.29% | 656716 | 312.66 | 91926 | 43.76 | ||
QwQ-32B-Preview (3.0bpw EXL2, max_tokens=8192) | bartowski/QwQ-32B-Preview-exl2_3_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 22990MiB | 1/2 | 1h 15m 18s | 289/410 | 70.49% | 656716 | 145.23 | 269937 | 59.69 | ||
QwQ-32B-Preview (3.0bpw EXL2, max_tokens=8192) | bartowski/QwQ-32B-Preview-exl2_3_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 22990MiB | 2/2 | 1h 19m 50s | 274/410 | 66.83% | 0/2, 0.00% | 656716 | 137.01 | 291818 | 60.88 | |
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2) | MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2 | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 47068MiB | 1/2 | 1h 26m 26s | 284/410 | 69.27% | 1/3, 33.33% | 696798 | 134.23 | 79925 | 15.40 | |
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2) | MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2 | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 47068MiB | 2/2 | 1h 26m 10s | 275/410 | 67.07% | 0/2, 0.00% | 696798 | 134.67 | 79778 | 15.42 | |
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2) | turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 45096MiB | 1/2 | 1h 8m 8s | 271/410 | 66.10% | < 70.24% | 696798 | 170.29 | 66670 | 16.29 | |
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2) | turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 45096MiB | 2/2 | 1h 10m 38s | 268/410 | 65.37% | < 70.24% | 1/3, 33.33% | 696798 | 164.23 | 69182 | 16.31 |
QwQ-32B-Preview (3.0bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_3_0 | - | 32B | EXL2 | TabbyAPI | RTX 6000 | 21574MiB | 1/2 | 1h 5m 30s | 268/410 | 65.37% | 1/3, 33.33% | 656716 | 166.95 | 205218 | 52.17 | |
QwQ-32B-Preview (3.0bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_3_0 | - | 32B | EXL2 | TabbyAPI | RTX 6000 | 21574MiB | 2/2 | 1h 8m 44s | 266/410 | 64.88% | 656716 | 159.10 | 215616 | 52.24 | ||
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2) | wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2 | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 45096MiB | 1/2 | 1h 11m 50s | 267/410 | 65.12% | 1/4, 25.00% | 696798 | 161.53 | 70538 | 16.35 | |
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2) | wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2 | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 45096MiB | 2/2 | 1h 13m 50s | 243/410 | 59.27% | 0/4, 0.00% | 696798 | 157.18 | 72718 | 16.40 | |
π Falcon3-10B-Instruct | tiiuae/Falcon3-10B-Instruct | - | 10B | HF | Ollama | RTX 6000 | 20906MiB | 1/2 | 35m 15s | 251/410 | 61.22% | 2/5, 40.00% | 702578 | 331.57 | 75501 | 35.63 | |
π Falcon3-10B-Instruct | tiiuae/Falcon3-10B-Instruct | - | 10B | HF | Ollama | RTX 6000 | 20906MiB | 2/2 | 35m 21s | 251/410 | 61.22% | 2/5, 40.00% | 702578 | 330.66 | 75501 | 35.53 | |
mistral-small-2409 (22B) | mistralai/Mistral-Small-Instruct-2409 | - | 22B | HF | Mistral | - | - | 1/2 | 25m 3s | 243/410 | 59.27% | > 53.66% | 1/4, 25.00% | 696798 | 462.38 | 73212 | 48.58 |
mistral-small-2409 (22B) | mistralai/Mistral-Small-Instruct-2409 | - | 22B | HF | Mistral | - | - | 2/2 | 20m 45s | 239/410 | 58.29% | > 53.66% | 1/4, 25.00% | 696798 | 558.10 | 76017 | 60.89 |
- Model: Model name (with relevant parameter and setting details)
- HF Main Model Name: Full name of the tested model as listed on Hugging Face
- HF Draft Model Name (speculative decoding): Draft model used for speculative decoding (if applicable)
- Size: Parameter count
- Format: Model format type (HF, EXL2, etc.)
- API: Service provider (TabbyAPI indicates local deployment)
- GPU: Graphics card used for this benchmark run
- GPU Mem: VRAM allocated to model and configuration
- Run: Benchmark run sequence number
- Duration: Total runtime of benchmark
- Total: Number of correct answers (determines ranking!)
- %: Percentage of correct answers
- TIGER-Lab: Comparison between TIGER-Lab (the makers of MMLU-Pro)'s CS benchmark results and mine
- Correct Random Guesses: When MMLU-Pro cannot definitively identify a model's answer choice, it defaults to random guessing and reports both the number of these random guesses and their accuracy (a high proportion of random guessing indicates problems with following the response format)
- Prompt tokens: Token count of input text
- tk/s: Tokens processed per second
- Completion tokens: Token count of generated response
- tk/s: Tokens generated per second
π Update 2025-01-04: Additional Insights After Further Analysis
Inspired by Teortaxes's valuable feedback on X, I performed additional analyses that revealed fascinating insights:
A key discovery emerged when comparing DeepSeek-V3 and Qwen2.5-72B-Instruct: While both models achieved identical accuracy scores of 77.93%, their response patterns differed substantially. Despite matching overall performance, they provided different answers on 101 questions! Moreover, they shared 45 incorrect responses, separate from their individual errors.
The analysis of unanswered questions yielded equally interesting results: Among the top local models (Athene-V2-Chat, DeepSeek-V3, Qwen2.5-72B-Instruct, and QwQ-32B-Preview), only 30 out of 410 questions (7.32%) received incorrect answers from all models. When expanding the analysis to include Claude and GPT-4, this number dropped to 23 questions (5.61%) that remained unsolved across all models.
This proves that the MMLU-Pro CS benchmark doesn't have a soft ceiling at 78%. If there's one, it'd rather be around 95%, confirming that this benchmark remains a robust and effective tool for evaluating LLMs now and in the foreseeable future.
π Update 2025-01-05: Analyzed All Results For Unsolved Questions
After analyzing ALL results for unsolved questions across my tested models, only 10 out of 410 (2.44%) remained unsolved.
This demonstrates that the MMLU-Pro CS benchmark maintains a high ceiling and remains a valuable tool for evaluating advanced language models.
Wolfram Ravenwolf is a German AI Engineer and an internationally active consultant and renowned researcher who's particularly passionate about local language models. You can follow him on X and Bluesky, read his previous LLM tests and comparisons on HF and Reddit, check out his models on Hugging Face, tip him on Ko-fi, or book him for a consultation.