AtakanTekparmak commited on
Commit
bf1a56c
·
verified ·
1 Parent(s): 1a4e62f

fix: Updated DPAB score

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -229,7 +229,7 @@ and the MMLU-Pro and DPAB results:
229
  | Benchmark Name | Qwen2.5-Coder-3B-Instruct | Dria-Agent-α-3B |
230
  |----------------|---------------------------|-----------------|
231
  | MMLU-Pro | 35.2 ([Self Reported](https://arxiv.org/pdf/2409.12186)) | 29.8* |
232
- | DPAB (Pythonic, Strict) | 28 | 72 |
233
 
234
  **\*Note:** The model tends to use Pythonic function calling for a lot of the test cases in STEM-related fields (math, physics, chemistry, etc.) in the MMLU-Pro benchmark, which isn't captured by the evaluation framework and scripts provided in their [Github repository](https://github.com/TIGER-AI-Lab/MMLU-Pro/tree/main). We haven't modified the script for evaluation, and leave it for the future iterations of this model. However, by performing qualitative analysis on the model responses, we suspect that the model's score will increase instead of suffering a ~6% decrease.
235
 
 
229
  | Benchmark Name | Qwen2.5-Coder-3B-Instruct | Dria-Agent-α-3B |
230
  |----------------|---------------------------|-----------------|
231
  | MMLU-Pro | 35.2 ([Self Reported](https://arxiv.org/pdf/2409.12186)) | 29.8* |
232
+ | DPAB (Pythonic, Strict) | 26 | 72 |
233
 
234
  **\*Note:** The model tends to use Pythonic function calling for a lot of the test cases in STEM-related fields (math, physics, chemistry, etc.) in the MMLU-Pro benchmark, which isn't captured by the evaluation framework and scripts provided in their [Github repository](https://github.com/TIGER-AI-Lab/MMLU-Pro/tree/main). We haven't modified the script for evaluation, and leave it for the future iterations of this model. However, by performing qualitative analysis on the model responses, we suspect that the model's score will increase instead of suffering a ~6% decrease.
235