wdevazelhes commited on
Commit
78b49d0
·
verified ·
1 Parent(s): e4c6425

add scores

Browse files
Files changed (1) hide show
  1. README.md +153 -1
README.md CHANGED
@@ -108,7 +108,159 @@ Use the code below to get started with the model.
108
 
109
  ## Evaluation
110
 
111
- <!-- This section describes the evaluation protocols and provides the results. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
  ### Testing Data, Factors & Metrics
114
 
 
108
 
109
  ## Evaluation
110
 
111
+ <table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
112
+ <colgroup>
113
+ <col style="width: 10%;">
114
+ <col style="width: 10%;">
115
+ <col style="width: 7%;">
116
+ <col style="width: 7%;">
117
+ <col style="width: 7%;">
118
+ <col style="background-color: rgba(80, 15, 213, 0.5); width: 7%;">
119
+ </colgroup>
120
+ <thead>
121
+ <tr>
122
+ <th>Category</th>
123
+ <th>Benchmark</th>
124
+ <th>Llama-3.2-1B</th>
125
+ <th>Qwen2.5-1.5B</th>
126
+ <th>SmolLM2-1.7B</th>
127
+ <th>Falcon3-1B-Instruct</th>
128
+ </tr>
129
+ </thead>
130
+ <tbody>
131
+ <tr>
132
+ <td rowspan="3">General</td>
133
+ <td>MMLU (5-shot)</td>
134
+ <td>23.4</td>
135
+ <td><b>58.4</b></td>
136
+ <td>48.4</td>
137
+ <td>43.9</td>
138
+ </tr>
139
+ <tr>
140
+ <td>MMLU-PRO (5-shot)</td>
141
+ <td>11.3</td>
142
+ <td><b>21.3</b></td>
143
+ <td>17.2</td>
144
+ <td>18.6</td>
145
+ </tr>
146
+ <tr>
147
+ <td>IFEval</td>
148
+ <td><b>55.8</b></td>
149
+ <td>44.4</td>
150
+ <td>53.0</td>
151
+ <td>54.4</td>
152
+ </tr>
153
+ <tr>
154
+ <td rowspan="3">Math</td>
155
+ <td>GSM8K (5-shot)</td>
156
+ <td>37.4</td>
157
+ <td><b>57.2</b></td>
158
+ <td>43.4</td>
159
+ <td>38.6</td>
160
+ </tr>
161
+ <tr>
162
+ <td>GSM8K (8-shot, COT)</td>
163
+ <td>35.6</td>
164
+ <td><b>62.2</b></td>
165
+ <td>47.2</td>
166
+ <td>41.8</td>
167
+ </tr>
168
+ <tr>
169
+ <td>MATH Lvl-5 (4-shot)</td>
170
+ <td><b>3.9</b></td>
171
+ <td>0.2</td>
172
+ <td>0.1</td>
173
+ <td>1.0</td>
174
+ </tr>
175
+ <tr>
176
+ <td rowspan="6">Reasoning</td>
177
+ <td>Arc Challenge (25-shot)</td>
178
+ <td>34.1</td>
179
+ <td>47.0</td>
180
+ <td><b>47.6</b></td>
181
+ <td>45.9</td>
182
+ </tr>
183
+ <tr>
184
+ <td>GPQA (0-shot)</td>
185
+ <td>25.3</td>
186
+ <td><b>29.6</b></td>
187
+ <td>28.7</td>
188
+ <td>26.5</td>
189
+ </tr>
190
+ <tr>
191
+ <td>GPQA (0-shot, COT)</td>
192
+ <td>13.2</td>
193
+ <td>9.2</td>
194
+ <td>16.0</td>
195
+ <td><b>21.3</b></td>
196
+ </tr>
197
+ <tr>
198
+ <td>MUSR (0-shot)</td>
199
+ <td>32.4</td>
200
+ <td>36.8</td>
201
+ <td>33.0</td>
202
+ <td><b>40.7</b></td>
203
+ </tr>
204
+ <tr>
205
+ <td>BBH (3-shot)</td>
206
+ <td>30.3</td>
207
+ <td><b>38.5</b></td>
208
+ <td>33.1</td>
209
+ <td>35.1</td>
210
+ </tr>
211
+ <tr>
212
+ <td>BBH (3-shot, COT)</td>
213
+ <td>0.0</td>
214
+ <td>20.3</td>
215
+ <td>0.8</td>
216
+ <td><b>30.5</b></td>
217
+ </tr>
218
+ <tr>
219
+ <td rowspan="5">CommonSense Understanding</td>
220
+ <td>PIQA (0-shot)</td>
221
+ <td>72.1</td>
222
+ <td>73.2</td>
223
+ <td><b>74.4</b></td>
224
+ <td>72.0</td>
225
+ </tr>
226
+ <tr>
227
+ <td>SciQ (0-shot)</td>
228
+ <td>61.8</td>
229
+ <td>69.5</td>
230
+ <td>71.4</td>
231
+ <td><b>86.8</b></td>
232
+ </tr>
233
+ <tr>
234
+ <td>Winogrande (0-shot)</td>
235
+ <td>-</td>
236
+ <td>-</td>
237
+ <td>-</td>
238
+ <td><b>60.2</b></td>
239
+ </tr>
240
+ <tr>
241
+ <td>OpenbookQA (0-shot)</td>
242
+ <td>40.2</td>
243
+ <td>40.4</td>
244
+ <td><b>42.8</b></td>
245
+ <td>40.0</td>
246
+ </tr>
247
+ <tr>
248
+ <td>MT-Bench (avg)</td>
249
+ <td>5.4</td>
250
+ <td><b>7.1</b></td>
251
+ <td>6.1</td>
252
+ <td>5.5</td>
253
+ </tr>
254
+ <tr>
255
+ <td rowspan="1">Instructions following</td>
256
+ <td>Alapaca (WC)</td>
257
+ <td><b>8.6</b></td>
258
+ <td><b>8.6</b></td>
259
+ <td>5.4</td>
260
+ <td>6.1</td>
261
+ </tr>
262
+ </tbody>
263
+ </table>
264
 
265
  ### Testing Data, Factors & Metrics
266