AtakanTekparmak
commited on
feat: Updated benchmark results
Browse files
README.md
CHANGED
@@ -206,12 +206,29 @@ We evaluate the model on the following benchmarks:
|
|
206 |
2. MMLU-Pro
|
207 |
3. **Dria-Pythonic-Agent-Benchmark (DPAB):** The benchmark we curated with a synthetic data generation +model-based validation + filtering and manual selection to evaluate LLMs on their Pythonic function calling ability, spanning multiple scenarios and tasks. More detailed information about the benchmark and the Github repo will be released soon.
|
208 |
|
209 |
-
Below are the evaluation results for Qwen2.5-Coder-3B-Instruct
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
210 |
|
211 |
| Benchmark Name | Qwen2.5-Coder-3B-Instruct | Dria-Agent-α-3B |
|
212 |
|----------------|---------------------------|-----------------|
|
213 |
-
| BFCL | TBD | TBD |
|
214 |
| MMLU-Pro | 35.2 ([Self Reported](https://arxiv.org/pdf/2409.12186)) | 29.8* |
|
215 |
-
| DPAB | 24 | 51 |
|
216 |
|
217 |
**\*Note:** The model tends to use Pythonic function calling for a lot of the test cases in STEM-related fields (math, physics, chemistry, etc.) in the MMLU-Pro benchmark, which isn't captured by the evaluation framework and scripts provided in their [Github repository](https://github.com/TIGER-AI-Lab/MMLU-Pro/tree/main). We haven't modified the script for evaluation, and leave it for the future iterations of this model. However, by performing qualitative analysis on the model responses, we suspect that the model's score will increase instead of suffering a ~6% decrease.
|
|
|
206 |
2. MMLU-Pro
|
207 |
3. **Dria-Pythonic-Agent-Benchmark (DPAB):** The benchmark we curated with a synthetic data generation +model-based validation + filtering and manual selection to evaluate LLMs on their Pythonic function calling ability, spanning multiple scenarios and tasks. More detailed information about the benchmark and the Github repo will be released soon.
|
208 |
|
209 |
+
Below are the BFCL results: evaluation results for ***Qwen2.5-Coder-3B-Instruct***, ***Dria-Agent-α-3B*** and ***gpt-4o-2024-11-20***
|
210 |
+
|
211 |
+
| Metric | Qwen/Qwen2.5-3B-Instruct | Dria-Agent-a-3B | gpt-4o-2024-11-20 (Prompt) |
|
212 |
+
|---------------------------------------|-----------|-----------|-----------|
|
213 |
+
| **Non-Live Simple AST** | 75.50% | 75.08% | 79.42% |
|
214 |
+
| **Non-Live Multiple AST** | 90.00% | 93.00% | 95.50% |
|
215 |
+
| **Non-Live Parallel AST** | 80.00% | 85.00% | 94.00% |
|
216 |
+
| **Non-Live Parallel Multiple AST** | 78.50% | 79.00% | 83.50% |
|
217 |
+
| **Non-Live Simple Exec** | 82.07% | 87.57% | 100.00% |
|
218 |
+
| **Non-Live Multiple Exec** | 86.00% | 85.14% | 94.00% |
|
219 |
+
| **Non-Live Parallel Exec** | 82.00% | 90.00% | 86.00% |
|
220 |
+
| **Non-Live Parallel Multiple Exec** | 80.00% | 88.00% | 77.50% |
|
221 |
+
| **Live Simple AST** | 68.22% | 70.16% | 83.72% |
|
222 |
+
| **Live Multiple AST** | 66.00% | 67.14% | 79.77% |
|
223 |
+
| **Live Parallel AST** | 62.50% | 50.00% | 87.50% |
|
224 |
+
| **Live Parallel Multiple AST** | 66.67% | 70.83% | 70.83% |
|
225 |
+
| **Relevance Detection** | 88.89% | 100.00% | 83.33% |
|
226 |
+
|
227 |
+
and the MMLU-Pro and DPAB results:
|
228 |
|
229 |
| Benchmark Name | Qwen2.5-Coder-3B-Instruct | Dria-Agent-α-3B |
|
230 |
|----------------|---------------------------|-----------------|
|
|
|
231 |
| MMLU-Pro | 35.2 ([Self Reported](https://arxiv.org/pdf/2409.12186)) | 29.8* |
|
232 |
+
| DPAB (Pythonic, Strict) | 24 | 51 |
|
233 |
|
234 |
**\*Note:** The model tends to use Pythonic function calling for a lot of the test cases in STEM-related fields (math, physics, chemistry, etc.) in the MMLU-Pro benchmark, which isn't captured by the evaluation framework and scripts provided in their [Github repository](https://github.com/TIGER-AI-Lab/MMLU-Pro/tree/main). We haven't modified the script for evaluation, and leave it for the future iterations of this model. However, by performing qualitative analysis on the model responses, we suspect that the model's score will increase instead of suffering a ~6% decrease.
|