MT-Bench Scores

#1
by leonardlin - opened

Since I noticed OmniBeagle and still had my setup for NeuralBeagle open:

# First Turn
omnibeagle-7b             1     8.32500 
# Second Turn
omnibeagle-7b             2     7.587500
# Average
omnibeagle-7b              7.956250                                  

In context average:

gpt-4                        8.990625                                  
omnibeagle-7b                7.956250                                  
gpt-3.5-turbo                7.943750                                  
claude-instant-v1            7.905660                                  
claude-v1                    7.900000                                  
neuralbeagle14-7b            7.628125                                  
orion-14b-chat               7.415625                                  

newplot(11).png

Great work guys! Needs some more coding/math related merges.

Those two are hardest for GPT-4 to judge on MT though so take it with a grain of perspective.
(as discussed in the MT-bench paper)

@leonardlin how many MT-benches have you ran?
Do you keep the model outputs and judge annotations?

I was thinking of setting up a public & crowdsourced dataset for those because, unlike Alpaca where they publish it all,
MT-bench details are hard to come by.
Usually it's just the final numbers.

Honestly code should be eval'd with EvalPlus https://evalplus.github.io/leaderboard.html - there are probably some specialized math benchmarks as well. While MT-Bench has been analyzed to be the best correlation (0.89) w/ Chatbot Arena rankings, I still have my doubts about the reasoning capabilities of 7B-parameter models. I guess others will have to play around with it and give their feedback.

fully agree @gblazex ! I'd be happy to share our model outputs & judge annotations, we've run a few models: mistral7binstructv0.2, zephyr7B-beta, Notus7B, OpenHermes, and our latest CapybaraHermes

@gblazex I've run... a lot of MT-bench (and JA MT-bench). I do have basically all the outputs and annotations. I have some other fish to fry (new training run, etc) but it's very high on my list to have a refactored llm-as-judge codebase that will be a public repo and allow people to easily merge results/metadata that hopefully will help build a dataset (it'll also have much better inference flexibility, yaml config files since prompting/templating can have a huge effect on benchmark scores, and ideally be a lot more push-button for the whole process).

@dvilasuero one of the things just higher on my priority list is getting argilla properly setup for improving the datasets for our next training run :)

We (anyone interested in better MT-bench, compiling scores) should definitely try to coordinate soon!

@dvilasuero that's a great offer Daniel, thank you!
I know @abacaj @xDAN2099 @SanjiWatsuki have private MT-bench runs too, they might contribute their outputs.

I'll set up a space where people can easily upload files.

@leonardlin how does that sound?

@dvilasuero that's a great offer Daniel, thank you!
I know @abacaj @xDAN2099 @SanjiWatsuki have private MT-bench runs too, they might contribute their outputs.

I'll set up a space where people can easily upload files.

Feel free to use. : )
@gblazex @dvilasuero
https://huggingface.co/xDAN-AI/xDAN-L1-Chat-RL-v1

Wow thanks @leonardlin ! Always happy to see numbers go up :)

Also very curious to see an exhaustive MT-Bench leaderboard. I'm planning to add it to LLM AutoEval.

@dvilasuero that's a great offer Daniel, thank you!
I know @abacaj @xDAN2099 @SanjiWatsuki have private MT-bench runs too, they might contribute their outputs.

I'll set up a space where people can easily upload files.

Feel free to use. : )
@gblazex @dvilasuero
https://huggingface.co/xDAN-AI/xDAN-L1-Chat-RL-v1

@gblazex @dvilasuero Here's the best site I've found with benchmarks for the latest models: https://llm.extractum.io/model/mlabonne%2FOmniBeagle-7B,6F7N4LPPCLWWbfV9hmJ3Va

@leonardlin Is it possible to share the commands you used to evaluate the model? I tried running MT-Bench and got significantly worse results, but I'm not sure I used the correct parameters. Thanks in advance!

More specifically, can you detail how you handled the second turn of MT-Bench: have you used the ID of another model (and if so, which one?)?

From FastChat/fastchat/llm_judge

I generate answers via:

time python gen_model_answer.py --bench-name mt_bench --model-path mlabonne/OmniBeagle-7B --model-id omnibeagle14-7b --num-gpus-total 2

And judge via

time python gen_judgment.py --bench-name mt_bench --model-list omnibeagle14-7b --judge-file data/judge_prompts.jsonl --parallel 2

And show results

python show_result.py --bench-name mt_bench

I'm looking at my scripts and I believe that's it (no changes to conversation.py either. On some of my testing, I have my own codebase (part of that rewrite) where I go and apply custom chat_templates (either those in the tokenizer_config.json or the ones I determine myself) to make sure models have optimal output but I just ran the default omnibeagle and got a decent score with this one.

2nd turn evaluation should be automatic.

Notably, the model name is used to determine the prompt that you use. I forked MT-Bench to let my models use their proper prompts during answer generation. If you used just Omnibeagle14-7B as the name, it probably used the zero-shot template which is ... probably fine? Most of these mergers have no coherent prompting format due to having so many pieces merged in, unfortunately

Second turn shouldn’t need special treatment I think.

Or only the second scores differ significantly?

As stated above, the key here is to make sure the right chat template is applied. If we don't use the right one, second turn is the most affected.

@SanjiWatsuki very cool! Is the fork accessible?

It's nothing special, I just edited fastchat/model/model_adapter.py to just apply the Alpaca prompt when I included the word "maid". It's not something really worth uploading.

@leonardlin Thanks for the info, I replaced the default conv template with alpaca (https://github.com/mlabonne/FastChat) and managed to reproduce your results.

Here's what I got with NeuralOmniBeagle-v2:

########## First turn ##########
                                    score
model                       turn         
gpt-4                       1     8.95625
OmniBeagle-7B               1     8.31250
NeuralOmniBeagle-7B-v2      1     8.24375

########## Second turn ##########
                                     score
model                       turn          
gpt-4                       2     9.025000
OmniBeagle-7B               2     7.837500
NeuralOmniBeagle-7B-v2      2     7.825000

########## Average ##########
                                score
model                                
gpt-4                        8.990625
OmniBeagle-7B                8.075000
NeuralOmniBeagle-7B-v2       8.034375

Sign up or log in to comment