driaforall
/

Dria-Agent-a-7B

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

andthattoo commited on 5 days ago

Commit

3b1d2c3

·

verified ·

1 Parent(s): a9cc635

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -225,7 +225,7 @@ and the MMLU-Pro and DPAB results:
 | Benchmark Name | Qwen2.5-Coder-7B-Instruct | Dria-Agent-α-7B |
 |----------------|---------------------------|-----------------|
 | MMLU-Pro | 45.6 ([Self Reported](https://arxiv.org/pdf/2409.12186)) | 42.54 |
-| DPAB (Pythonic, Strict) | 30.0 | 51.0 |
 **\*Note:** The model tends to use Pythonic function calling for a lot of the test cases in STEM-related fields (math, physics, chemistry, etc.) in the MMLU-Pro benchmark, which isn't captured by the evaluation framework and scripts provided in their [Github repository](https://github.com/TIGER-AI-Lab/MMLU-Pro/tree/main). We haven't modified the script for evaluation, and leave it for the future iterations of this model. However, by performing qualitative analysis on the model responses, we suspect that the model's score will increase instead of suffering a ~3% decrease.

 | Benchmark Name | Qwen2.5-Coder-7B-Instruct | Dria-Agent-α-7B |
 |----------------|---------------------------|-----------------|
 | MMLU-Pro | 45.6 ([Self Reported](https://arxiv.org/pdf/2409.12186)) | 42.54 |
+| DPAB (Pythonic, Strict) | 44.0 | 70.0 |
 **\*Note:** The model tends to use Pythonic function calling for a lot of the test cases in STEM-related fields (math, physics, chemistry, etc.) in the MMLU-Pro benchmark, which isn't captured by the evaluation framework and scripts provided in their [Github repository](https://github.com/TIGER-AI-Lab/MMLU-Pro/tree/main). We haven't modified the script for evaluation, and leave it for the future iterations of this model. However, by performing qualitative analysis on the model responses, we suspect that the model's score will increase instead of suffering a ~3% decrease.