Brinebreath-Llama-3.1-70B
I made this since I started having some problems with Cathallama. This seems to behave well during some days testing.
Notable Performance
- 7% overall success rate increase on MMLU-PRO over LLaMA 3.1 70b at Q4_0
- Strong performance in MMLU-PRO categories overall
- Great performance during manual testing
Creation workflow
Models merged
- meta-llama/Meta-Llama-3.1-70B-Instruct
- NousResearch/Hermes-3-Llama-3.1-70B
- abacusai/Dracarys-Llama-3.1-70B-Instruct
- VAGOsolutions/Llama-3.1-SauerkrautLM-70b-Instruct
flowchart TD
A[Hermes 3] -->|Merge with| B[Meta-Llama-3.1]
C[Dracarys] -->|Merge with| D[Meta-Llama-3.1]
B -->| | E[Merge]
D -->| | E[Merge]
G[SauerkrautLM] -->|Merge with| E[Merge]
E[Merge] -->| | F[Brinebreath]
Testing
Hyperparameters
- Temperature: 0.0 for automated, 0.9 for manual
- Penalize repeat sequence: 1.05
- Consider N tokens for penalize: 256
- Penalize repetition of newlines
- Top-K sampling: 40
- Top-P sampling: 0.95
- Min-P sampling: 0.05
LLaMAcpp Version
- b3600-1-g2339a0be
- -fa -ngl -1 -ctk f16 --no-mmap
Tested Files
- Brinebreath-Llama-3.1-70B.Q4_0.gguf
- Meta-Llama-3.1-70B-Instruct.Q4_0.gguf
Manual testing
Category | Test Case | Brinebreath-Llama-3.1-70B.Q4_0.gguf | Meta-Llama-3.1-70B-Instruct.Q4_0.gguf |
---|---|---|---|
Common Sense | Ball on cup | OK | OK |
Big duck small horse | OK | OK | |
Killers | OK | OK | |
Strawberry r's | KO | KO | |
9.11 or 9.9 bigger | KO | KO | |
Dragon or lens | KO | KO | |
Shirts | OK | KO | |
Sisters | OK | KO | |
Jane faster | OK | OK | |
Programming | JSON | OK | OK |
Python snake game | OK | KO | |
Math | Door window combination | OK | KO |
Smoke | Poem | OK | OK |
Story | OK | OK |
Note: See sample_generations.txt on the main folder of the repo for the raw generations.
MMLU-PRO
Model | Success % |
---|---|
Brinebreath-3.1-70B.Q4_0.gguf | 49.0% |
Meta-Llama-3.1-70B-Instruct.Q4_0.gguf | 42.0% |
MMLU-PRO category | Brinebreath-3.1-70B.Q4_0.gguf | Meta-Llama-3.1-70B-Instruct.Q4_0.gguf |
---|---|---|
Business | 45.0% | 40.0% |
Law | 40.0% | 35.0% |
Psychology | 85.0% | 80.0% |
Biology | 80.0% | 75.0% |
Chemistry | 50.0% | 45.0% |
History | 65.0% | 60.0% |
Other | 55.0% | 50.0% |
Health | 70.0% | 65.0% |
Economics | 80.0% | 75.0% |
Math | 35.0% | 30.0% |
Physics | 45.0% | 40.0% |
Computer Science | 60.0% | 55.0% |
Philosophy | 50.0% | 45.0% |
Engineering | 45.0% | 40.0% |
Note: MMLU-PRO Overall tested with 100 questions. Categories testes with 20 questions from each category.
PubmedQA
Model Name | Success% |
---|---|
Brinebreath-3.1-70B.Q4_0.gguf | 71.00% |
Meta-Llama-3.1-70B-Instruct.Q4_0.gguf | 68.00% |
Note: PubmedQA tested with 100 questions.
Request
If you are hiring in the EU or can sponsor a visa, PM me :D
PS. Thank you mradermacher for the GGUFs!
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 36.29 |
IFEval (0-Shot) | 55.33 |
BBH (3-Shot) | 55.46 |
MATH Lvl 5 (4-Shot) | 29.98 |
GPQA (0-shot) | 12.86 |
MuSR (0-shot) | 17.49 |
MMLU-PRO (5-shot) | 46.62 |
- Downloads last month
- 82
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for gbueno86/Brinebreath-Llama-3.1-70B
Merge model
this model
Evaluation results
- strict accuracy on IFEval (0-Shot)Open LLM Leaderboard55.330
- normalized accuracy on BBH (3-Shot)Open LLM Leaderboard55.460
- exact match on MATH Lvl 5 (4-Shot)Open LLM Leaderboard29.980
- acc_norm on GPQA (0-shot)Open LLM Leaderboard12.860
- acc_norm on MuSR (0-shot)Open LLM Leaderboard17.490
- accuracy on MMLU-PRO (5-shot)test set Open LLM Leaderboard46.620