Llama-2-7b-hf-DPO-LookAhead-0_TTree1.4_TT0.9_TP0.7_TE0.2_V7

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.7133	0.3	63	0.6946	0.0868	0.0595	0.6000	0.0274	-111.1324	-112.8988	0.5159	0.4902
0.5044	0.6	126	0.6814	0.2402	0.0924	0.6000	0.1478	-110.8034	-111.3656	0.5007	0.4738
0.6555	0.9	189	0.6392	-0.0496	-0.2815	0.7000	0.2319	-114.5420	-114.2632	0.5375	0.5056
0.2983	1.2	252	0.6671	-0.8670	-1.3823	0.5	0.5153	-125.5504	-122.4372	0.4453	0.4053
0.287	1.5	315	0.6743	-1.0040	-1.5229	0.4000	0.5189	-126.9560	-123.8071	0.3434	0.2980
0.313	1.8	378	0.7727	-1.1663	-1.4516	0.4000	0.2853	-126.2434	-125.4304	0.3244	0.2767
0.1026	2.1	441	0.8556	-1.5616	-1.8026	0.4000	0.2410	-129.7528	-129.3835	0.2187	0.1675
0.1738	2.4	504	1.1593	-2.7915	-2.8593	0.4000	0.0677	-140.3199	-141.6827	0.0630	0.0046
0.2095	2.7	567	1.1725	-2.9060	-2.9579	0.4000	0.0519	-141.3057	-142.8270	0.0427	-0.0158
0.0235	3.0	630	1.1519	-2.8728	-2.9359	0.4000	0.0631	-141.0865	-142.4955	0.0425	-0.0160