From original readme

gemma-2-9b-it finetuned by hybrid WPO, utilizing two types of data:

On-policy sampled gemma outputs based on Ultrafeedback prompts.
GPT-4-turbo outputs based on Ultrafeedback prompts.

In comparison to the preference data construction method in our paper, we switch to RLHFlow/ArmoRM-Llama3-8B-v0.1 to score the outputs, and choose the outputs with maximum/minimum scores to form a preference pair.

We provide our training data at wzhouad/gemma-2-ultrafeedback-hybrid.