This is a MoE-ification of TinyLlama/TinyLlama-1.1B-Chat-v1.0 using the Mixtral branch of mergekit
The Goal was to MoE-fy the TinyLlama model and then use this as a base model to finetune from. The intuition being finetuning 8x1b should give better performance than finetuning 1b by itself.
More work coming!
Chat Template
def make_prompt(instruction):
return f"<|im_start|>user\n{instruction}<|im_end|>\n<|im_start|>assistant\n"
llm.generate(make_prompt('What is quantum tunneling?'))
Mergekit Config
base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
gate_mode: hidden
dtype: bfloat16
experts:
- source_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
positive_prompts: [""]
- source_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
positive_prompts: [""]
- source_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
positive_prompts: [""]
- source_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
positive_prompts: [""]
- source_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
positive_prompts: [""]
- source_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
positive_prompts: [""]
- source_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
positive_prompts: [""]
- source_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
positive_prompts: [""]
Eval
Thanks to u/mhenrichsen for the HellaSwag score
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|---------|-------|------|-----:|--------|-----:|---|-----:|
|hellaswag|Yaml |none | 0|acc |0.4657|± |0.0050|
| | |none | 0|acc\_norm|0.6042|± |0.0049|
- Downloads last month
- 259
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.