Llama-2-7b-hf-DPO-LookAhead-0_TTree1.4_TT0.9_TP0.7_TE0.2_V7

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6759	0.3023	60	0.6970	0.0373	0.0397	0.6000	-0.0024	-149.7405	-120.8940	0.4429	0.4532
0.6811	0.6045	120	0.6723	-0.0412	-0.0677	0.5	0.0265	-150.8149	-121.6795	0.4688	0.4793
0.5824	0.9068	180	0.6747	0.0390	-0.0060	0.8000	0.0450	-150.1981	-120.8773	0.4537	0.4631
0.3049	1.2091	240	0.5606	-0.3769	-0.6960	0.7000	0.3191	-157.0981	-125.0365	0.3873	0.3966
0.3915	1.5113	300	0.5289	-0.4550	-0.8493	0.9000	0.3943	-158.6304	-125.8171	0.3314	0.3395
0.476	1.8136	360	0.5109	-0.7144	-1.1970	0.9000	0.4826	-162.1081	-128.4113	0.2160	0.2235
0.1137	2.1159	420	0.5121	-1.1098	-1.6334	0.8000	0.5236	-166.4716	-132.3654	0.0934	0.1001
0.3063	2.4181	480	0.4482	-1.9206	-2.8102	0.9000	0.8895	-178.2394	-140.4735	-0.0433	-0.0370
0.2409	2.7204	540	0.4540	-1.9538	-2.8279	0.9000	0.8741	-178.4166	-140.8054	-0.0659	-0.0598