πŸΊπŸ¦β€β¬› LLM Comparison/Test: DeepSeek-V3, QVQ-72B-Preview, Falcon3 10B, Llama 3.3 70B, Nemotron 70B in my updated MMLU-Pro CS benchmark

Community Article Published January 2, 2025

image/png

Introduction

New year, new benchmarks! Tested some new models (DeepSeek-V3, QVQ-72B-Preview, Falcon3 10B) that came out after my latest report, and some "older" ones (Llama 3.3 70B Instruct, Llama 3.1 Nemotron 70B Instruct) that I had not tested yet.

All of this is an update to my original report from December 2024 where you'll find further details about all the other (25!) models I've tested and compared in this series of MMLU-Pro CS benchmarks: LLM Comparison/Test: 25 SOTA LLMs (including QwQ) through 59 MMLU-Pro CS benchmark runs

New Models Tested

  1. DeepSeek-V3 is THE new open-weights star, and it's a heavyweight at 671B, with 37B active parameters in its Mixture-of-Experts architecture. I tested it through the official DeepSeek API and it was quite fast (~50 tokens/s) and super cheap (66Β’ for 4 runs at 1.4M tokens total).

    Surprisingly, though, it didn't become the #1 local model - at least not in my MMLU-Pro CS benchmark, where it "only" scored 78%, the same as the much smaller Qwen2.5 72B and less than the even smaller QwQ 32B Preview! But it's still a great score and beats GPT-4o, Mistral Large, Llama 3.1 405B and most other models.

    Plus, there are a lot of positive reports about this model - so definitely take a closer look at it (if you can run it, locally or through the API) and test it with your own use cases. This advice generally applies to all models and benchmarks!

    That said, personally, I'm still on the fence as I've experienced some repetiton issues that remind me of the old days of local LLMs. There could be various explanations for this, though, so I'll keep investigating and testing it further as it certainly is a milestone for open LLMs.

  2. Llama 3.3 70B Instruct, the latest iteration of Meta's Llama series, focused on multilinguality so its general performance doesn't differ much from its predecessors. Still, even quantized down to just 4-bit, it scored ~71%, which is a little bit better than the unquantized (!) Llama 3.1 70B Instruct and almost on par with gpt-4o-2024-11-20!

    Not much else to say here, Llama has been somewhat overshadowed by the other models, especially those from China. So looking forward to what Llama 4 will bring, and hopefully soon.

  3. Llama 3.1 Nemotron 70B Instruct is the oldest model in this batch, at 3 months old it's basically ancient in LLM terms. Still solid, though, scoring ~70% at ~4-bit, extremely close to the unquantized Llama 3.1 70B it's based on.

    Not reflected in the test is how it feels when using it - like no other model I know of, it feels more like a multiple-choice dialog than a normal chat. Which may be a good or bad thing, depending on your use case. For something like a customer support bot, this style may be a perfect fit.

  4. Like with DeepSeek-V3, I'm surprised (and even disappointed) that QVQ-72B-Preview didn't score much higher. QwQ 32B did so much better, but even with 16K max tokens, QVQ 72B didn't get any better through reasoning more. Only 70%, compared to QwQ 32B's 79% and Qwen2.5 72B's 78%.

    But maybe that was to be expected, as QVQ is focused on VISUAL reasoning - which this benchmark doesn't measure. However, considering it's based on Qwen and how great both the QwQ 32B and Qwen 72B models perform, I had hoped QVQ being both 72B and reasoning would have had much more of an impact on its general performance.

    So we'll have to keep waiting for a QwQ 72B to see if more parameters improve reasoning further - and by how much. But if you have a use case for visual reasoning, this is probably your best (and only) option among local models.

  5. Falcon3 10B Instruct did surprisingly well, scoring 61%. Most small models don't even make it past the 50% threshold to get onto the chart at all (like IBM Granite 8B, which I also tested but it didn't make the cut).

    Falcon3 10B even surpasses Mistral Small which at 22B is over twice as big. Definitely worth a look if you need something small but capable in English, French, Spanish or Portuguese.

About the Benchmark

The MMLU-Pro benchmark is a comprehensive evaluation of large language models across various categories, including computer science, mathematics, physics, chemistry, and more. It's designed to assess a model's ability to understand and apply knowledge across a wide range of subjects, providing a robust measure of general intelligence. While it is a multiple choice test, instead of 4 answer options like in its predecessor MMLU, there are now 10 options per question, which drastically reduces the probability of correct answers by chance. Additionally, the focus is increasingly on complex reasoning tasks rather than pure factual knowledge.

For my benchmarks, I currently limit myself to the Computer Science category with its 410 questions. This pragmatic decision is based on several factors: First, I place particular emphasis on responses from my usual work environment, since I frequently use these models in this context during my daily work. Second, with local models running on consumer hardware, there are practical constraints around computation time - a single run already takes several hours with larger models, and I generally conduct at least two runs to ensure consistency.

Unlike typical benchmarks that only report single scores, I conduct multiple test runs for each model to capture performance variability. This comprehensive approach delivers a more accurate and nuanced understanding of each model's true capabilities. By executing at least two benchmark runs per model, I establish a robust assessment of both performance levels and consistency. The results feature error bars that show standard deviation, illustrating how performance varies across different test runs.

The benchmarks for this study alone required over 70 88 hours of runtime. With additional categories or runs, the testing duration would have become so long with the available resources that the tested models would have been outdated by the time the study was completed. Therefore, establishing practical framework conditions and boundaries is essential to achieve meaningful results within a reasonable timeframe.

Detailed Results

Here's the complete table, including the previous results from the original report:

Model HF Main Model Name HF Draft Model Name (speculative decoding) Size Format API GPU GPU Mem Run Duration Total % TIGER-Lab Correct Random Guesses Prompt tokens tk/s Completion tokens tk/s
claude-3-5-sonnet-20241022 - - - - Anthropic - - 1/2 31m 50s 340/410 82.93% ~= 82.44% 694458 362.78 97438 50.90
claude-3-5-sonnet-20241022 - - - - Anthropic - - 2/2 31m 39s 338/410 82.44% == 82.44% 694458 364.82 97314 51.12
gemini-1.5-pro-002 - - - - Gemini - - 1/2 31m 7s 335/410 81.71% > 71.22% 648675 346.82 78311 41.87
gemini-1.5-pro-002 - - - - Gemini - - 2/2 30m 40s 327/410 79.76% > 71.22% 648675 351.73 76063 41.24
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384) bartowski/QwQ-32B-Preview-exl2_8_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 38436MiB 1/2 2h 3m 30s 325/410 79.27% 0/2, 0.00% 656716 88.58 327825 44.22
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384) bartowski/QwQ-32B-Preview-exl2_8_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 38436MiB 2/2 2h 3m 35s 324/410 79.02% 656716 88.52 343440 46.29
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache) wolfram/Athene-V2-Chat-4.65bpw-h6-exl2 - 72B EXL2 TabbyAPI RTX 6000 44496MiB 1/2 2h 13m 5s 326/410 79.51% > 73.41% 656716 82.21 142256 17.81
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache) wolfram/Athene-V2-Chat-4.65bpw-h6-exl2 - 72B EXL2 TabbyAPI RTX 6000 44496MiB 2/2 2h 14m 53s 317/410 77.32% > 73.41% 656716 81.11 143659 17.74
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache) LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2 - 72B EXL2 TabbyAPI 2x RTX 3090 41150MiB 1/2 3h 7m 58s 320/410 78.05% > 74.88% 656716 58.21 139499 12.36
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache) LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2 - 72B EXL2 TabbyAPI 2x RTX 3090 41150MiB 2/2 3h 5m 19s 319/410 77.80% > 74.88% 656716 59.04 138135 12.42
πŸ†• DeepSeek-V3 deepseek-ai/DeepSeek-V3 - 671B HF DeepSeek - - 1/4 20m 22s 320/410 78.05% 628029 512.38 66807 54.50
πŸ†• DeepSeek-V3 deepseek-ai/DeepSeek-V3 - 671B HF DeepSeek - - 2/4 27m 43s 320/410 78.05% 628029 376.59 66874 40.10
πŸ†• DeepSeek-V3 deepseek-ai/DeepSeek-V3 - 671B HF DeepSeek - - 3/4 19m 45s 319/410 77.80% 628029 528.39 64470 54.24
πŸ†• DeepSeek-V3 deepseek-ai/DeepSeek-V3 - 671B HF DeepSeek - - 4/4 19m 45s 319/410 77.80% 628029 375.73 69531 41.60
QwQ-32B-Preview (4.25bpw EXL2, max_tokens=16384) bartowski/QwQ-32B-Preview-exl2_4_25 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 27636MiB 1/2 1h 56m 8s 319/410 77.80% 0/1, 0.00% 656716 94.20 374973 53.79
QwQ-32B-Preview (4.25bpw EXL2, max_tokens=16384) bartowski/QwQ-32B-Preview-exl2_4_25 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 27636MiB 2/2 1h 55m 44s 318/410 77.56% 656716 94.45 377638 54.31
gpt-4o-2024-08-06 - - - - OpenAI - - 1/2 34m 54s 320/410 78.05% ~= 78.29% 1/2, 50.00% 631448 300.79 99103 47.21
gpt-4o-2024-08-06 - - - - OpenAI - - 2/2 42m 41s 316/410 77.07% ~< 78.29% 1/3, 33.33% 631448 246.02 98466 38.36
QwQ-32B-Preview (8.0bpw EXL2) bartowski/QwQ-32B-Preview-exl2_8_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 38528MiB 1/4 1h 29m 49s 324/410 79.02% 0/1, 0.00% 656716 121.70 229008 42.44
QwQ-32B-Preview (8.0bpw EXL2) bartowski/QwQ-32B-Preview-exl2_8_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 38528MiB 2/4 1h 32m 30s 314/410 76.59% 0/2, 0.00% 656716 118.24 239161 43.06
QwQ-32B-Preview (8.0bpw EXL2) bartowski/QwQ-32B-Preview-exl2_8_0 - 32B EXL2 TabbyAPI RTX 6000 37000MiB 3/4 2h 25m 24s 308/410 75.12% 0/2, 0.00% 656716 75.23 232208 26.60
QwQ-32B-Preview (8.0bpw EXL2) bartowski/QwQ-32B-Preview-exl2_8_0 - 32B EXL2 TabbyAPI RTX 6000 37000MiB 4/4 2h 27m 27s 305/410 74.39% 0/3, 0.00% 656716 74.19 235650 26.62
QwQ-32B-Preview-abliterated (4.5bpw EXL2, max_tokens=16384) ibrahimkettaneh_QwQ-32B-Preview-abliterated-4.5bpw-h8-exl2 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 28556MiB 1/2 2h 10m 53s 310/410 75.61% 656716 83.59 412512 52.51
QwQ-32B-Preview-abliterated (4.5bpw EXL2, max_tokens=16384) ibrahimkettaneh_QwQ-32B-Preview-abliterated-4.5bpw-h8-exl2 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 28556MiB 2/2 2h 25m 29s 310/410 75.61% 656716 75.20 478590 54.80
mistral-large-2407 (123B) mistralai/Mistral-Large-Instruct-2407 - 123B HF Mistral - - 1/2 40m 23s 310/410 75.61% > 70.24% 696798 287.13 79444 32.74
mistral-large-2407 (123B) mistralai/Mistral-Large-Instruct-2407 - 123B HF Mistral - - 2/2 46m 55s 308/410 75.12% > 70.24% 0/1, 0.00% 696798 247.21 75971 26.95
Llama-3.1-405B-Instruct-FP8 meta-llama/Llama-3.1-405B-Instruct-FP8 - 405B HF IONOS - - 1/2 2h 5m 28s 311/410 75.85% 648580 86.11 79191 10.51
Llama-3.1-405B-Instruct-FP8 meta-llama/Llama-3.1-405B-Instruct-FP8 - 405B HF IONOS - - 2/2 2h 10m 19s 307/410 74.88% 648580 82.90 79648 10.18
mistral-large-2411 (123B) mistralai/Mistral-Large-Instruct-2411 - 123B HF Mistral - - 1/2 41m 46s 302/410 73.66% 1/3, 33.33% 696798 277.70 82028 32.69
mistral-large-2411 (123B) mistralai/Mistral-Large-Instruct-2411 - 123B HF Mistral - - 2/2 32m 47s 300/410 73.17% 0/1, 0.00% 696798 353.53 77998 39.57
QwQ-32B-Preview (4.25bpw EXL2) bartowski/QwQ-32B-Preview-exl2_4_25 - 32B EXL2 TabbyAPI RTX 6000 26198MiB 1/4 1h 39m 49s 308/410 75.12% 0/1, 0.00% 656716 109.59 243552 40.64
QwQ-32B-Preview (4.25bpw EXL2) bartowski/QwQ-32B-Preview-exl2_4_25 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 27750MiB 2/4 1h 22m 12s 304/410 74.15% 656716 133.04 247314 50.10
QwQ-32B-Preview (4.25bpw EXL2) bartowski/QwQ-32B-Preview-exl2_4_25 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 27750MiB 3/4 1h 21m 39s 296/410 72.20% 656716 133.94 246020 50.18
QwQ-32B-Preview (4.25bpw EXL2) bartowski/QwQ-32B-Preview-exl2_4_25 - 32B EXL2 TabbyAPI RTX 6000 26198MiB 4/4 1h 42m 33s 294/410 71.71% 656716 106.63 250222 40.63
chatgpt-4o-latest @ 2024-11-18 - - - - OpenAI - - 1/2 28m 17s 302/410 73.66% < 78.29% 2/4, 50.00% 631448 371.33 146558 86.18
chatgpt-4o-latest @ 2024-11-18 - - - - OpenAI - - 2/2 28m 31s 298/410 72.68% < 78.29% 2/2, 100.00% 631448 368.19 146782 85.59
gpt-4o-2024-11-20 - - - - OpenAI - - 1/2 25m 35s 296/410 72.20% 1/7, 14.29% 631448 410.38 158694 103.14
gpt-4o-2024-11-20 - - - - OpenAI - - 2/2 26m 10s 294/410 71.71% 1/7, 14.29% 631448 400.95 160378 101.84
πŸ†• Llama-3.3-70B-Instruct (4.0bpw EXL2) LoneStriker/Llama-3.3-70B-Instruct-4.0bpw-h6-exl2 - 70B EXL2 TabbyAPI RTX 6000 47148MiB 1/2 2h 2m 33s 293/410 71.46% 648580 88.15 87107 11.84
πŸ†• Llama-3.3-70B-Instruct (4.0bpw EXL2) LoneStriker/Llama-3.3-70B-Instruct-4.0bpw-h6-exl2 - 70B EXL2 TabbyAPI RTX 6000 47148MiB 2/2 1h 33m 59s 293/410 71.46% 534360 94.70 89510 15.86
Llama-3.1-70B-Instruct meta-llama/Llama-3.1-70B-Instruct - 70B HF IONOS - - 1/2 41m 12s 291/410 70.98% > 66.34% 3/12, 25.00% 648580 261.88 102559 41.41
Llama-3.1-70B-Instruct meta-llama/Llama-3.1-70B-Instruct - 70B HF IONOS - - 2/2 39m 48s 287/410 70.00% > 66.34% 3/14, 21.43% 648580 271.12 106644 44.58
πŸ†• Llama-3.1-Nemotron-70B-Instruct (4.25bpw EXL2) bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-exl2_4_25 - 70B EXL2 TabbyAPI RTX 6000 40104MiB 1/2 2h 13m 3s 290/410 70.73% 640380 80.18 157235 19.69
πŸ†• Llama-3.1-Nemotron-70B-Instruct (4.25bpw EXL2) bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-exl2_4_25 - 70B EXL2 TabbyAPI RTX 6000 40104MiB 2/2 2h 13m 15s 287/410 70.00% 0/1, 0.00% 640380 80.07 157471 19.69
πŸ†• QVQ-72B-Preview (4.65bpw EXL2, max_tokens=16384) wolfram/QVQ-72B-Preview-4.65bpw-h6-exl2 Qwen/Qwen2.5-Coder-0.5B-Instruct 72B EXL2 TabbyAPI RTX 6000 46260MiB 1/2 3h 43m 12s 290/410 70.73% 1/3, 33.33% 656716 49.02 441187 32.93
πŸ†• QVQ-72B-Preview (4.65bpw EXL2, max_tokens=16384) wolfram/QVQ-72B-Preview-4.65bpw-h6-exl2 Qwen/Qwen2.5-Coder-0.5B-Instruct 72B EXL2 TabbyAPI RTX 6000 46260MiB 2/2 3h 47m 29s 284/410 69.27% 0/2, 0.00% 656716 48.10 450363 32.99
gemini-1.5-flash-002 - - - - Gemini - - 1/2 13m 19s 288/410 70.24% > 63.41% 1/6, 16.67% 648675 808.52 80535 100.38
gemini-1.5-flash-002 - - - - Gemini - - 2/2 22m 30s 285/410 69.51% > 63.41% 2/7, 28.57% 648675 479.42 80221 59.29
Llama-3.2-90B-Vision-Instruct meta-llama/Llama-3.2-90B-Vision-Instruct - 90B HF Azure - - 1/2 33m 6s 289/410 70.49% 4/7, 57.14% 640380 321.96 88997 44.74
Llama-3.2-90B-Vision-Instruct meta-llama/Llama-3.2-90B-Vision-Instruct - 90B HF Azure - - 2/2 31m 31s 281/410 68.54% 2/5, 40.00% 640380 338.10 85381 45.08
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 Qwen/Qwen2.5-Coder-3B-Instruct 32B EXL2 TabbyAPI RTX 6000 45880MiB 1/7 41m 59s 289/410 70.49% 656716 260.29 92126 36.51
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 40036MiB 2/7 34m 24s 286/410 69.76% 656716 317.48 89487 43.26
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 Qwen/Qwen2.5-Coder-3B-Instruct 32B EXL2 TabbyAPI RTX 6000 45880MiB 3/7 41m 27s 283/410 69.02% 0/1, 0.00% 656716 263.62 90349 36.27
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0 32B EXL2 TabbyAPI RTX 6000 43688MiB 4/7 42m 32s 283/410 69.02% 0/1, 0.00% 656716 256.77 90899 35.54
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0 32B EXL2 TabbyAPI RTX 6000 43688MiB 5/7 44m 34s 282/410 68.78% 0/1, 0.00% 656716 245.24 96470 36.03
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 - 32B EXL2 TabbyAPI RTX 6000 38620MiB 6/7 1h 2m 8s 282/410 68.78% 656716 175.98 92767 24.86
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 40036MiB 7/7 34m 56s 280/410 68.29% 656716 312.66 91926 43.76
QwQ-32B-Preview (3.0bpw EXL2, max_tokens=8192) bartowski/QwQ-32B-Preview-exl2_3_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 22990MiB 1/2 1h 15m 18s 289/410 70.49% 656716 145.23 269937 59.69
QwQ-32B-Preview (3.0bpw EXL2, max_tokens=8192) bartowski/QwQ-32B-Preview-exl2_3_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 22990MiB 2/2 1h 19m 50s 274/410 66.83% 0/2, 0.00% 656716 137.01 291818 60.88
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2) MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2 - 123B EXL2 TabbyAPI RTX 6000 47068MiB 1/2 1h 26m 26s 284/410 69.27% 1/3, 33.33% 696798 134.23 79925 15.40
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2) MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2 - 123B EXL2 TabbyAPI RTX 6000 47068MiB 2/2 1h 26m 10s 275/410 67.07% 0/2, 0.00% 696798 134.67 79778 15.42
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2) turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw - 123B EXL2 TabbyAPI RTX 6000 45096MiB 1/2 1h 8m 8s 271/410 66.10% < 70.24% 696798 170.29 66670 16.29
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2) turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw - 123B EXL2 TabbyAPI RTX 6000 45096MiB 2/2 1h 10m 38s 268/410 65.37% < 70.24% 1/3, 33.33% 696798 164.23 69182 16.31
QwQ-32B-Preview (3.0bpw EXL2) bartowski/QwQ-32B-Preview-exl2_3_0 - 32B EXL2 TabbyAPI RTX 6000 21574MiB 1/2 1h 5m 30s 268/410 65.37% 1/3, 33.33% 656716 166.95 205218 52.17
QwQ-32B-Preview (3.0bpw EXL2) bartowski/QwQ-32B-Preview-exl2_3_0 - 32B EXL2 TabbyAPI RTX 6000 21574MiB 2/2 1h 8m 44s 266/410 64.88% 656716 159.10 215616 52.24
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2) wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2 - 123B EXL2 TabbyAPI RTX 6000 45096MiB 1/2 1h 11m 50s 267/410 65.12% 1/4, 25.00% 696798 161.53 70538 16.35
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2) wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2 - 123B EXL2 TabbyAPI RTX 6000 45096MiB 2/2 1h 13m 50s 243/410 59.27% 0/4, 0.00% 696798 157.18 72718 16.40
πŸ†• Falcon3-10B-Instruct tiiuae/Falcon3-10B-Instruct - 10B HF Ollama RTX 6000 20906MiB 1/2 35m 15s 251/410 61.22% 2/5, 40.00% 702578 331.57 75501 35.63
πŸ†• Falcon3-10B-Instruct tiiuae/Falcon3-10B-Instruct - 10B HF Ollama RTX 6000 20906MiB 2/2 35m 21s 251/410 61.22% 2/5, 40.00% 702578 330.66 75501 35.53
mistral-small-2409 (22B) mistralai/Mistral-Small-Instruct-2409 - 22B HF Mistral - - 1/2 25m 3s 243/410 59.27% > 53.66% 1/4, 25.00% 696798 462.38 73212 48.58
mistral-small-2409 (22B) mistralai/Mistral-Small-Instruct-2409 - 22B HF Mistral - - 2/2 20m 45s 239/410 58.29% > 53.66% 1/4, 25.00% 696798 558.10 76017 60.89
  • Model: Model name (with relevant parameter and setting details)
  • HF Main Model Name: Full name of the tested model as listed on Hugging Face
  • HF Draft Model Name (speculative decoding): Draft model used for speculative decoding (if applicable)
  • Size: Parameter count
  • Format: Model format type (HF, EXL2, etc.)
  • API: Service provider (TabbyAPI indicates local deployment)
  • GPU: Graphics card used for this benchmark run
  • GPU Mem: VRAM allocated to model and configuration
  • Run: Benchmark run sequence number
  • Duration: Total runtime of benchmark
  • Total: Number of correct answers (determines ranking!)
  • %: Percentage of correct answers
  • TIGER-Lab: Comparison between TIGER-Lab (the makers of MMLU-Pro)'s CS benchmark results and mine
  • Correct Random Guesses: When MMLU-Pro cannot definitively identify a model's answer choice, it defaults to random guessing and reports both the number of these random guesses and their accuracy (a high proportion of random guessing indicates problems with following the response format)
  • Prompt tokens: Token count of input text
  • tk/s: Tokens processed per second
  • Completion tokens: Token count of generated response
  • tk/s: Tokens generated per second

πŸ†• Update 2025-01-04: Additional Insights After Further Analysis

Inspired by Teortaxes's valuable feedback on X, I performed additional analyses that revealed fascinating insights:

A key discovery emerged when comparing DeepSeek-V3 and Qwen2.5-72B-Instruct: While both models achieved identical accuracy scores of 77.93%, their response patterns differed substantially. Despite matching overall performance, they provided different answers on 101 questions! Moreover, they shared 45 incorrect responses, separate from their individual errors.

The analysis of unanswered questions yielded equally interesting results: Among the top local models (Athene-V2-Chat, DeepSeek-V3, Qwen2.5-72B-Instruct, and QwQ-32B-Preview), only 30 out of 410 questions (7.32%) received incorrect answers from all models. When expanding the analysis to include Claude and GPT-4, this number dropped to 23 questions (5.61%) that remained unsolved across all models.

This proves that the MMLU-Pro CS benchmark doesn't have a soft ceiling at 78%. If there's one, it'd rather be around 95%, confirming that this benchmark remains a robust and effective tool for evaluating LLMs now and in the foreseeable future.

image/png


πŸ†• Update 2025-01-05: Analyzed All Results For Unsolved Questions

After analyzing ALL results for unsolved questions across my tested models, only 10 out of 410 (2.44%) remained unsolved.

This demonstrates that the MMLU-Pro CS benchmark maintains a high ceiling and remains a valuable tool for evaluating advanced language models.


Wolfram Ravenwolf is a German AI Engineer and an internationally active consultant and renowned researcher who's particularly passionate about local language models. You can follow him on X and Bluesky, read his previous LLM tests and comparisons on HF and Reddit, check out his models on Hugging Face, tip him on Ko-fi, or book him for a consultation.