Pinkstack commited on
Commit
d3ed0cf
·
verified ·
1 Parent(s): 06261f4

Adding Evaluation Results (#3)

Browse files

- Adding Evaluation Results (2611afea36acb2288320b1e0bbceb35a31a9784d)

Files changed (1) hide show
  1. README.md +114 -1
README.md CHANGED
@@ -24,6 +24,105 @@ datasets:
24
  - amphora/QwQ-LongCoT-130K
25
  base_model:
26
  - microsoft/phi-4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  ---
28
 
29
  gguf/final version: https://huggingface.co/Pinkstack/PARM-V2-phi-4-16k-CoT-o1-gguf
@@ -75,4 +174,18 @@ All generated locally and pretty quickly too! 😲 Due to our very limited resou
75
  - **License:** MIT
76
  - **Finetuned from model :** microsoft/phi-4
77
 
78
- This phi-4 model was trained with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  - amphora/QwQ-LongCoT-130K
25
  base_model:
26
  - microsoft/phi-4
27
+ model-index:
28
+ - name: SuperThoughts-CoT-14B-16k-o1-QwQ
29
+ results:
30
+ - task:
31
+ type: text-generation
32
+ name: Text Generation
33
+ dataset:
34
+ name: IFEval (0-Shot)
35
+ type: wis-k/instruction-following-eval
36
+ split: train
37
+ args:
38
+ num_few_shot: 0
39
+ metrics:
40
+ - type: inst_level_strict_acc and prompt_level_strict_acc
41
+ value: 5.15
42
+ name: averaged accuracy
43
+ source:
44
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Pinkstack%2FSuperThoughts-CoT-14B-16k-o1-QwQ
45
+ name: Open LLM Leaderboard
46
+ - task:
47
+ type: text-generation
48
+ name: Text Generation
49
+ dataset:
50
+ name: BBH (3-Shot)
51
+ type: SaylorTwift/bbh
52
+ split: test
53
+ args:
54
+ num_few_shot: 3
55
+ metrics:
56
+ - type: acc_norm
57
+ value: 52.85
58
+ name: normalized accuracy
59
+ source:
60
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Pinkstack%2FSuperThoughts-CoT-14B-16k-o1-QwQ
61
+ name: Open LLM Leaderboard
62
+ - task:
63
+ type: text-generation
64
+ name: Text Generation
65
+ dataset:
66
+ name: MATH Lvl 5 (4-Shot)
67
+ type: lighteval/MATH-Hard
68
+ split: test
69
+ args:
70
+ num_few_shot: 4
71
+ metrics:
72
+ - type: exact_match
73
+ value: 40.79
74
+ name: exact match
75
+ source:
76
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Pinkstack%2FSuperThoughts-CoT-14B-16k-o1-QwQ
77
+ name: Open LLM Leaderboard
78
+ - task:
79
+ type: text-generation
80
+ name: Text Generation
81
+ dataset:
82
+ name: GPQA (0-shot)
83
+ type: Idavidrein/gpqa
84
+ split: train
85
+ args:
86
+ num_few_shot: 0
87
+ metrics:
88
+ - type: acc_norm
89
+ value: 19.02
90
+ name: acc_norm
91
+ source:
92
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Pinkstack%2FSuperThoughts-CoT-14B-16k-o1-QwQ
93
+ name: Open LLM Leaderboard
94
+ - task:
95
+ type: text-generation
96
+ name: Text Generation
97
+ dataset:
98
+ name: MuSR (0-shot)
99
+ type: TAUR-Lab/MuSR
100
+ args:
101
+ num_few_shot: 0
102
+ metrics:
103
+ - type: acc_norm
104
+ value: 21.79
105
+ name: acc_norm
106
+ source:
107
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Pinkstack%2FSuperThoughts-CoT-14B-16k-o1-QwQ
108
+ name: Open LLM Leaderboard
109
+ - task:
110
+ type: text-generation
111
+ name: Text Generation
112
+ dataset:
113
+ name: MMLU-PRO (5-shot)
114
+ type: TIGER-Lab/MMLU-Pro
115
+ config: main
116
+ split: test
117
+ args:
118
+ num_few_shot: 5
119
+ metrics:
120
+ - type: acc
121
+ value: 47.43
122
+ name: accuracy
123
+ source:
124
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Pinkstack%2FSuperThoughts-CoT-14B-16k-o1-QwQ
125
+ name: Open LLM Leaderboard
126
  ---
127
 
128
  gguf/final version: https://huggingface.co/Pinkstack/PARM-V2-phi-4-16k-CoT-o1-gguf
 
174
  - **License:** MIT
175
  - **Finetuned from model :** microsoft/phi-4
176
 
177
+ This phi-4 model was trained with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
178
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
179
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/Pinkstack__SuperThoughts-CoT-14B-16k-o1-QwQ-details)!
180
+ Summarized results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/contents/viewer/default/train?q=Pinkstack%2FSuperThoughts-CoT-14B-16k-o1-QwQ&sort[column]=Average%20%E2%AC%86%EF%B8%8F&sort[direction]=desc)!
181
+
182
+ | Metric |Value (%)|
183
+ |-------------------|--------:|
184
+ |**Average** | 31.17|
185
+ |IFEval (0-Shot) | 5.15|
186
+ |BBH (3-Shot) | 52.85|
187
+ |MATH Lvl 5 (4-Shot)| 40.79|
188
+ |GPQA (0-shot) | 19.02|
189
+ |MuSR (0-shot) | 21.79|
190
+ |MMLU-PRO (5-shot) | 47.43|
191
+