AtakanTekparmak commited on
Commit
edcb18e
·
verified ·
1 Parent(s): f6f544b

feat: Updated benchmark results

Browse files
Files changed (1) hide show
  1. README.md +20 -3
README.md CHANGED
@@ -206,12 +206,29 @@ We evaluate the model on the following benchmarks:
206
  2. MMLU-Pro
207
  3. **Dria-Pythonic-Agent-Benchmark (DPAB):** The benchmark we curated with a synthetic data generation +model-based validation + filtering and manual selection to evaluate LLMs on their Pythonic function calling ability, spanning multiple scenarios and tasks. More detailed information about the benchmark and the Github repo will be released soon.
208
 
209
- Below are the evaluation results for Qwen2.5-Coder-3B-Instruct and Dria-Agent-α-3B
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
210
 
211
  | Benchmark Name | Qwen2.5-Coder-3B-Instruct | Dria-Agent-α-3B |
212
  |----------------|---------------------------|-----------------|
213
- | BFCL | TBD | TBD |
214
  | MMLU-Pro | 35.2 ([Self Reported](https://arxiv.org/pdf/2409.12186)) | 29.8* |
215
- | DPAB | 24 | 51 |
216
 
217
  **\*Note:** The model tends to use Pythonic function calling for a lot of the test cases in STEM-related fields (math, physics, chemistry, etc.) in the MMLU-Pro benchmark, which isn't captured by the evaluation framework and scripts provided in their [Github repository](https://github.com/TIGER-AI-Lab/MMLU-Pro/tree/main). We haven't modified the script for evaluation, and leave it for the future iterations of this model. However, by performing qualitative analysis on the model responses, we suspect that the model's score will increase instead of suffering a ~6% decrease.
 
206
  2. MMLU-Pro
207
  3. **Dria-Pythonic-Agent-Benchmark (DPAB):** The benchmark we curated with a synthetic data generation +model-based validation + filtering and manual selection to evaluate LLMs on their Pythonic function calling ability, spanning multiple scenarios and tasks. More detailed information about the benchmark and the Github repo will be released soon.
208
 
209
+ Below are the BFCL results: evaluation results for ***Qwen2.5-Coder-3B-Instruct***, ***Dria-Agent-α-3B*** and ***gpt-4o-2024-11-20***
210
+
211
+ | Metric | Qwen/Qwen2.5-3B-Instruct | Dria-Agent-a-3B | gpt-4o-2024-11-20 (Prompt) |
212
+ |---------------------------------------|-----------|-----------|-----------|
213
+ | **Non-Live Simple AST** | 75.50% | 75.08% | 79.42% |
214
+ | **Non-Live Multiple AST** | 90.00% | 93.00% | 95.50% |
215
+ | **Non-Live Parallel AST** | 80.00% | 85.00% | 94.00% |
216
+ | **Non-Live Parallel Multiple AST** | 78.50% | 79.00% | 83.50% |
217
+ | **Non-Live Simple Exec** | 82.07% | 87.57% | 100.00% |
218
+ | **Non-Live Multiple Exec** | 86.00% | 85.14% | 94.00% |
219
+ | **Non-Live Parallel Exec** | 82.00% | 90.00% | 86.00% |
220
+ | **Non-Live Parallel Multiple Exec** | 80.00% | 88.00% | 77.50% |
221
+ | **Live Simple AST** | 68.22% | 70.16% | 83.72% |
222
+ | **Live Multiple AST** | 66.00% | 67.14% | 79.77% |
223
+ | **Live Parallel AST** | 62.50% | 50.00% | 87.50% |
224
+ | **Live Parallel Multiple AST** | 66.67% | 70.83% | 70.83% |
225
+ | **Relevance Detection** | 88.89% | 100.00% | 83.33% |
226
+
227
+ and the MMLU-Pro and DPAB results:
228
 
229
  | Benchmark Name | Qwen2.5-Coder-3B-Instruct | Dria-Agent-α-3B |
230
  |----------------|---------------------------|-----------------|
 
231
  | MMLU-Pro | 35.2 ([Self Reported](https://arxiv.org/pdf/2409.12186)) | 29.8* |
232
+ | DPAB (Pythonic, Strict) | 24 | 51 |
233
 
234
  **\*Note:** The model tends to use Pythonic function calling for a lot of the test cases in STEM-related fields (math, physics, chemistry, etc.) in the MMLU-Pro benchmark, which isn't captured by the evaluation framework and scripts provided in their [Github repository](https://github.com/TIGER-AI-Lab/MMLU-Pro/tree/main). We haven't modified the script for evaluation, and leave it for the future iterations of this model. However, by performing qualitative analysis on the model responses, we suspect that the model's score will increase instead of suffering a ~6% decrease.