kenhktsui
/

nano-phi-115M-v0.1

@@ -56,10 +56,11 @@ license: mit
 # Model Card for nano-phi-115M-v0.1
 Inspired by [Phi2](https://huggingface.co/microsoft/phi-2), and open source small language model attempts like [smol_llama-101M-GQA](https://huggingface.co/BEE-spoke-data/smol_llama-101M-GQA).
-Pre-trained with training 7B token from scratch, with application of quality filter to datasets resulting in 0.26B token.
-The control is [kenhktsui/nano-phi-115M-control-v0.1](https://huggingface.co/kenhktsui/nano-phi-115M-control-v0.1), where full dataset (0.6B) is used.
-Not much degradation in performance despite only using 42% of the data.
-It just took 2d 4h to train in Colab with a A100 40GB (~USD$ 100).
 It achieves quite competitive results in evaluation given its training token, and training data size.
 Yet, there are still large gaps (particularly in ARC, HellaSwag, MMLU and GSM8K) between nano-phi-115M-v0.1 and phi-2, where author will attempt to narrow down the gap in the future.
 No alignment has been done yet.
@@ -79,25 +80,25 @@ No alignment has been done yet.
 ## [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
-| Metric                | kenhktsui/nano-phi-115M-v0.1|[kenhktsui/nano-phi-115M-control-v0.1](https://huggingface.co/kenhktsui/nano-phi-115M-control-v0.1)|[microsoft/phi-2](https://huggingface.co/microsoft/phi-2)|
-|-----------------------|---------------------------|---------------------------|---------------------------|
-| Model Para            | 115M     |115M  |2.7B |
-| Dataset Size          | 0.26B     |0.6B  |250B |
-| Training Token        | 0.26B     |0.6B  |1.4T |
-| Context Length        |1024      |1024  |2048|
-| Device                |1xA100-40G|1xA100-40G |96xA100-80G|
-| Training Time         |2d4h      |2d4h  |14d|
-| Metric                | kenhktsui/nano-phi-115M-v0.1|[kenhktsui/nano-phi-115M-control-v0.1](https://huggingface.co/kenhktsui/nano-phi-115M-control-v0.1)|[microsoft/phi-2](https://huggingface.co/microsoft/phi-2) (Reproduced)|
-|-----------------------|---------------------------|---------------------------|---------------------------|
-| Avg.                  | 28.68    |28.75 |61.53 |
-| ARC (25-shot)         | 21.93    |21.67 |61.52 |
-| HellaSwag (10-shot)   | 27.87    |26.89 |75.13 |
-| MMLU (5-shot)         | 25.30    |24.76 |58.23 |
-| TruthfulQA (0-shot)   | 46.01    |47.69 |44.46 |
-| Winogrande (5-shot)   | 50.99    |51.46 |74.51 |
-| GSM8K (5-shot)        |  0.0     |0.0 |55.34  |
 Details:

 # Model Card for nano-phi-115M-v0.1
 Inspired by [Phi2](https://huggingface.co/microsoft/phi-2), and open source small language model attempts like [smol_llama-101M-GQA](https://huggingface.co/BEE-spoke-data/smol_llama-101M-GQA).
+Pre-trained with training 7B token **from scratch**, with application of quality filter to datasets resulting in 0.26B token.
+The control is [kenhktsui/nano-phi-115M-control-v0.1](https://huggingface.co/kenhktsui/nano-phi-115M-control-v0.1), where full dataset (0.6B) is used.
+Not much degradation in performance despite only using **42%** of the data due to the effective quality filter.
+In fact, upon inspection, the 6000 steps chkpt achieves similar performance as this model, signaling underlying **effective training due to high quality data**.
+It just took 1d to train in Colab with a A100 40GB (**<USD$ 50**).
 It achieves quite competitive results in evaluation given its training token, and training data size.
 Yet, there are still large gaps (particularly in ARC, HellaSwag, MMLU and GSM8K) between nano-phi-115M-v0.1 and phi-2, where author will attempt to narrow down the gap in the future.
 No alignment has been done yet.
 ## [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
+| Metric                | kenhktsui/nano-phi-115M-v0.1|kenhktsui/nano-phi-115M-v0.1 (6000 steps)|[kenhktsui/nano-phi-115M-control-v0.1](https://huggingface.co/kenhktsui/nano-phi-115M-control-v0.1)|[microsoft/phi-2](https://huggingface.co/microsoft/phi-2)|
+|-----------------------|---------------------------|---------------------------|---------------------------|---------------------------|
+| Model Para            | 115M     |115M |115M  |2.7B |
+| Dataset Size          | 0.26B     |0.26B |0.6B  |250B |
+| Training Token        | 7B     |3B|7B  |1.4T |
+| Context Length        |1024      |1024|1024  |2048|
+| Device                |1xA100-40G|1xA100-40G|1xA100-40G |96xA100-80G|
+| Training Time         |2d4h      |1d|2d4h  |14d|
+| Metric                | kenhktsui/nano-phi-115M-v0.1|kenhktsui/nano-phi-115M-v0.1 (6000 steps)|[kenhktsui/nano-phi-115M-control-v0.1](https://huggingface.co/kenhktsui/nano-phi-115M-control-v0.1)|[microsoft/phi-2](https://huggingface.co/microsoft/phi-2) (Reproduced)|
+|-----------------------|---------------------------|---------------------------|---------------------------|---------------------------|
+| Avg.                  | 28.68    |29.03 | 28.75 |61.53 |
+| ARC (25-shot)         | 21.93    |22.27 | 21.67 |61.52 |
+| HellaSwag (10-shot)   | 27.87    |26.88 | 26.89 |75.13 |
+| MMLU (5-shot)         | 25.30    |25.01 | 24.76 |58.23 |
+| TruthfulQA (0-shot)   | 46.01    |48.03 | 47.69 |44.46 |
+| Winogrande (5-shot)   | 50.99    |52.01 | 51.46 |74.51 |
+| GSM8K (5-shot)        |  0.0     |0.0 | 0.0 |55.34  |
 Details: