Update README.md
Browse files
README.md
CHANGED
@@ -56,10 +56,11 @@ license: mit
|
|
56 |
# Model Card for nano-phi-115M-v0.1
|
57 |
|
58 |
Inspired by [Phi2](https://huggingface.co/microsoft/phi-2), and open source small language model attempts like [smol_llama-101M-GQA](https://huggingface.co/BEE-spoke-data/smol_llama-101M-GQA).
|
59 |
-
Pre-trained with training 7B token from scratch
|
60 |
-
The control is [kenhktsui/nano-phi-115M-control-v0.1](https://huggingface.co/kenhktsui/nano-phi-115M-control-v0.1), where full dataset (0.6B) is used.
|
61 |
-
Not much degradation in performance despite only using 42
|
62 |
-
|
|
|
63 |
It achieves quite competitive results in evaluation given its training token, and training data size.
|
64 |
Yet, there are still large gaps (particularly in ARC, HellaSwag, MMLU and GSM8K) between nano-phi-115M-v0.1 and phi-2, where author will attempt to narrow down the gap in the future.
|
65 |
No alignment has been done yet.
|
@@ -79,25 +80,25 @@ No alignment has been done yet.
|
|
79 |
|
80 |
## [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
|
81 |
|
82 |
-
| Metric | kenhktsui/nano-phi-115M-v0.1|[kenhktsui/nano-phi-115M-control-v0.1](https://huggingface.co/kenhktsui/nano-phi-115M-control-v0.1)|[microsoft/phi-2](https://huggingface.co/microsoft/phi-2)|
|
83 |
-
|
84 |
-
| Model Para | 115M |115M |2.7B |
|
85 |
-
| Dataset Size | 0.26B |0.6B |250B |
|
86 |
-
| Training Token |
|
87 |
-
| Context Length |1024 |1024 |2048|
|
88 |
-
| Device |1xA100-40G|1xA100-40G |96xA100-80G|
|
89 |
-
| Training Time |2d4h |2d4h |14d|
|
90 |
-
|
91 |
-
|
92 |
-
| Metric | kenhktsui/nano-phi-115M-v0.1|[kenhktsui/nano-phi-115M-control-v0.1](https://huggingface.co/kenhktsui/nano-phi-115M-control-v0.1)|[microsoft/phi-2](https://huggingface.co/microsoft/phi-2) (Reproduced)|
|
93 |
-
|
94 |
-
| Avg. | 28.68 |28.75 |61.53 |
|
95 |
-
| ARC (25-shot) | 21.93 |21.67 |61.52 |
|
96 |
-
| HellaSwag (10-shot) | 27.87 |26.89 |75.13 |
|
97 |
-
| MMLU (5-shot) | 25.30 |24.76 |58.23 |
|
98 |
-
| TruthfulQA (0-shot) | 46.01 |47.69 |44.46 |
|
99 |
-
| Winogrande (5-shot) | 50.99 |51.46 |74.51 |
|
100 |
-
| GSM8K (5-shot) | 0.0 |0.0 |55.34 |
|
101 |
|
102 |
Details:
|
103 |
|
|
|
56 |
# Model Card for nano-phi-115M-v0.1
|
57 |
|
58 |
Inspired by [Phi2](https://huggingface.co/microsoft/phi-2), and open source small language model attempts like [smol_llama-101M-GQA](https://huggingface.co/BEE-spoke-data/smol_llama-101M-GQA).
|
59 |
+
Pre-trained with training 7B token **from scratch**, with application of quality filter to datasets resulting in 0.26B token.
|
60 |
+
The control is [kenhktsui/nano-phi-115M-control-v0.1](https://huggingface.co/kenhktsui/nano-phi-115M-control-v0.1), where full dataset (0.6B) is used.
|
61 |
+
Not much degradation in performance despite only using **42%** of the data due to the effective quality filter.
|
62 |
+
In fact, upon inspection, the 6000 steps chkpt achieves similar performance as this model, signaling underlying **effective training due to high quality data**.
|
63 |
+
It just took 1d to train in Colab with a A100 40GB (**<USD$ 50**).
|
64 |
It achieves quite competitive results in evaluation given its training token, and training data size.
|
65 |
Yet, there are still large gaps (particularly in ARC, HellaSwag, MMLU and GSM8K) between nano-phi-115M-v0.1 and phi-2, where author will attempt to narrow down the gap in the future.
|
66 |
No alignment has been done yet.
|
|
|
80 |
|
81 |
## [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
|
82 |
|
83 |
+
| Metric | kenhktsui/nano-phi-115M-v0.1|kenhktsui/nano-phi-115M-v0.1 (6000 steps)|[kenhktsui/nano-phi-115M-control-v0.1](https://huggingface.co/kenhktsui/nano-phi-115M-control-v0.1)|[microsoft/phi-2](https://huggingface.co/microsoft/phi-2)|
|
84 |
+
|-----------------------|---------------------------|---------------------------|---------------------------|---------------------------|
|
85 |
+
| Model Para | 115M |115M |115M |2.7B |
|
86 |
+
| Dataset Size | 0.26B |0.26B |0.6B |250B |
|
87 |
+
| Training Token | 7B |3B|7B |1.4T |
|
88 |
+
| Context Length |1024 |1024|1024 |2048|
|
89 |
+
| Device |1xA100-40G|1xA100-40G|1xA100-40G |96xA100-80G|
|
90 |
+
| Training Time |2d4h |1d|2d4h |14d|
|
91 |
+
|
92 |
+
|
93 |
+
| Metric | kenhktsui/nano-phi-115M-v0.1|kenhktsui/nano-phi-115M-v0.1 (6000 steps)|[kenhktsui/nano-phi-115M-control-v0.1](https://huggingface.co/kenhktsui/nano-phi-115M-control-v0.1)|[microsoft/phi-2](https://huggingface.co/microsoft/phi-2) (Reproduced)|
|
94 |
+
|-----------------------|---------------------------|---------------------------|---------------------------|---------------------------|
|
95 |
+
| Avg. | 28.68 |29.03 | 28.75 |61.53 |
|
96 |
+
| ARC (25-shot) | 21.93 |22.27 | 21.67 |61.52 |
|
97 |
+
| HellaSwag (10-shot) | 27.87 |26.88 | 26.89 |75.13 |
|
98 |
+
| MMLU (5-shot) | 25.30 |25.01 | 24.76 |58.23 |
|
99 |
+
| TruthfulQA (0-shot) | 46.01 |48.03 | 47.69 |44.46 |
|
100 |
+
| Winogrande (5-shot) | 50.99 |52.01 | 51.46 |74.51 |
|
101 |
+
| GSM8K (5-shot) | 0.0 |0.0 | 0.0 |55.34 |
|
102 |
|
103 |
Details:
|
104 |
|