Safetensors
English
olmo2
amanrangapur commited on
Commit
5dbe404
·
verified ·
1 Parent(s): ae29fed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -20
README.md CHANGED
@@ -154,22 +154,22 @@ Both stages contribute equally to the final performance of the OLMo model. After
154
 
155
  ### Architecture
156
 
157
- OLMo 13B architecture with peer models for comparison.
158
 
159
- | | **OLMo 13B** | [OLMo 7B](https://huggingface.co/allenai/OLMo-7B-hf) | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) | PaLM 8B |
160
  |------------------------|-------------------|-------------------|---------------------|--------------------|--------------------|------------------|
161
  | d_model | 4096 | 4096 | 4096 | 4096 | 4544 | 4096 |
162
- | num heads | 32 | 32 | 32 | 32 | 71 | 16 |
163
- | num layers | 32 | 32 | 32 | 32 | 32 | 32 |
164
  | MLP ratio | ~8/3 | ~8/3 | ~8/3 | ~8/3 | 4 | 4 |
165
- | LayerNorm type | non-parametric LN | non-parametric LN | RMSNorm | parametric LN | parametric LN | parametric LN |
166
  | pos embeddings | RoPE | RoPE | RoPE | RoPE | RoPE | RoPE |
167
  | attention variant | full | full | GQA | full | MQA | MQA |
168
  | biases | none | none | none | in LN only | in LN only | none |
169
  | block type | sequential | sequential | sequential | sequential | parallel | parallel |
170
  | activation | SwiGLU | SwiGLU | SwiGLU | SwiGLU | GeLU | SwiGLU |
171
- | sequence length | 4096 | 2048 | 4096 | 2048 | 2048 | 2048 |
172
- | batch size (instances) | 1024 | 2160 | 1024 | 2048 | 2304 | 512 |
173
  | batch size (tokens) | ~4M | ~4M | ~4M | ~4M | ~4M | ~1M |
174
  | weight tying | no | no | no | no | no | yes |
175
 
@@ -180,36 +180,33 @@ AdamW optimizer parameters are shown below.
180
 
181
  | Size | Peak LR | Betas | Epsilon | Weight Decay |
182
  |------|------------|-----------------|-------------|--------------|
183
- | 7B | 3.0E-4 | (0.9, 0.95) | 1.0E-5 | 0.1 |
184
- | 13B | 9.0E-4 | (0.9, 0.95) | 1.0E-5 | 0.1 |
185
 
186
  Optimizer settings comparison with peer models.
187
 
188
- | | **OLMo2 7B** | [OLMo 1.0 7B](https://huggingface.co/allenai/OLMo-7B-hf) | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) |
189
  |-----------------------|------------------|------------------|---------------------|--------------------|--------------------|
190
- | warmup steps | 2000 | 5000 | 2000 | 2000 | 1000 |
191
- | peak LR | 4.0E-04 | 3.0E-04 | 3.0E-04 | 3.0E-04 | 6.0E-04 |
192
  | minimum LR | 3.0E-05 | 3.0E-05 | 3.0E-05 | 3.0E-05 | 1.2E-05 |
193
  | weight decay | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |
194
  | beta1 | 0.9 | 0.9 | 0.9 | 0.9 | 0.99 |
195
  | beta2 | 0.95 | 0.95 | 0.95 | 0.95 | 0.999 |
196
- | epsilon | 1.0E-05 | 1.0E-05 | 1.0E-05 | 1.0E-05 | 1.0E-05 |
197
- | LR schedule | cosine | linear | cosine | cosine | cosine |
198
  | gradient clipping | global 1.0 | global 1.0 | global 1.0 | global 1.0 | global 1.0 |
199
  | gradient reduce dtype | FP32 | FP32 | FP32 | FP32 | BF16 |
200
  | optimizer state dtype | FP32 | FP32 | most likely FP32 | FP32 | FP32 |
201
 
202
 
203
-
204
  ## Bias, Risks, and Limitations
205
 
206
- Like any base language model or fine-tuned model without safety filtering, it is relatively easy for a user to prompt these models to generate harmful and generally sensitive content.
207
- Such content can also be produced unintentionally, especially in the case of bias, so we recommend users consider the risks of applications of this technology.
208
 
209
- Otherwise, many facts from OLMo or any LLM will often not be true, so they should be checked.
210
 
211
 
212
- ## Citation
213
 
214
  **BibTeX:**
215
 
@@ -225,7 +222,7 @@ Otherwise, many facts from OLMo or any LLM will often not be true, so they shoul
225
  **APA:**
226
 
227
  Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kinney, R., Tafjord, O., Jha, A., Ivison, H., Magnusson, I., Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu, K., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J., Khot, T., Merrill, W., Morrison, J., Muennighoff, N., Naik, A., Nam, C., Peters, M., Pyatkin, V., Ravichander, A., Schwenk, D., Shah, S., Smith, W., Subramani, N., Wortsman, M., Dasigi, P., Lambert, N., Richardson, K., Dodge, J., Lo, K., Soldaini, L., Smith, N., & Hajishirzi, H. (2024). OLMo: Accelerating the Science of Language Models. Preprint.
228
-
229
  ## Model Card Contact
230
 
231
 
 
154
 
155
  ### Architecture
156
 
157
+ OLMo 7B architecture with peer models for comparison.
158
 
159
+ | | **OLMo2 7B** | [OLMo2 13B](https://huggingface.co/allenai/OLMo2-13B-1124) | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) | PaLM 8B |
160
  |------------------------|-------------------|-------------------|---------------------|--------------------|--------------------|------------------|
161
  | d_model | 4096 | 4096 | 4096 | 4096 | 4544 | 4096 |
162
+ | num heads | 32 | 42 | 32 | 32 | 71 | 16 |
163
+ | num layers | 32 | 40 | 32 | 32 | 32 | 32 |
164
  | MLP ratio | ~8/3 | ~8/3 | ~8/3 | ~8/3 | 4 | 4 |
165
+ | LayerNorm type | RMS Norm | RMS Norm | RMSNorm | parametric LN | parametric LN | parametric LN |
166
  | pos embeddings | RoPE | RoPE | RoPE | RoPE | RoPE | RoPE |
167
  | attention variant | full | full | GQA | full | MQA | MQA |
168
  | biases | none | none | none | in LN only | in LN only | none |
169
  | block type | sequential | sequential | sequential | sequential | parallel | parallel |
170
  | activation | SwiGLU | SwiGLU | SwiGLU | SwiGLU | GeLU | SwiGLU |
171
+ | sequence length | 4096 | 4096 | 4096 | 2048 | 2048 | 2048 |
172
+ | batch size (instances) | 1024 | 2048 | 1024 | 2048 | 2304 | 512 |
173
  | batch size (tokens) | ~4M | ~4M | ~4M | ~4M | ~4M | ~1M |
174
  | weight tying | no | no | no | no | no | yes |
175
 
 
180
 
181
  | Size | Peak LR | Betas | Epsilon | Weight Decay |
182
  |------|------------|-----------------|-------------|--------------|
183
+ | 7B | 3.0E-4 | (0.9, 0.95) | 1.0E-8 | 0.1 |
184
+ | 13B | 9.0E-4 | (0.9, 0.95) | 1.0E-8 | 0.1 |
185
 
186
  Optimizer settings comparison with peer models.
187
 
188
+ | | **OLMo2 7B** | [OLMo2 13B](https://huggingface.co/allenai/OLMo2-13B-1124) | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) |
189
  |-----------------------|------------------|------------------|---------------------|--------------------|--------------------|
190
+ | warmup steps | 2000 | 2000 | 2000 | 2000 | 1000 |
191
+ | peak LR | 3.0E-04 | 9.0E-04 | 3.0E-04 | 3.0E-04 | 6.0E-04 |
192
  | minimum LR | 3.0E-05 | 3.0E-05 | 3.0E-05 | 3.0E-05 | 1.2E-05 |
193
  | weight decay | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |
194
  | beta1 | 0.9 | 0.9 | 0.9 | 0.9 | 0.99 |
195
  | beta2 | 0.95 | 0.95 | 0.95 | 0.95 | 0.999 |
196
+ | epsilon | 1.0E-08 | 1.0E-08 | 1.0E-05 | 1.0E-05 | 1.0E-05 |
197
+ | LR schedule | cosine | cosine | cosine | cosine | cosine |
198
  | gradient clipping | global 1.0 | global 1.0 | global 1.0 | global 1.0 | global 1.0 |
199
  | gradient reduce dtype | FP32 | FP32 | FP32 | FP32 | BF16 |
200
  | optimizer state dtype | FP32 | FP32 | most likely FP32 | FP32 | FP32 |
201
 
202
 
 
203
  ## Bias, Risks, and Limitations
204
 
205
+ Like any base language model or fine-tuned model without safety filtering, these models can easily be prompted by users to generate harmful and sensitive content. Such content may also be produced unintentionally, especially in cases involving bias, so we recommend that users consider the risks when applying this technology. Additionally, many statements from OLMo or any LLM are often inaccurate, so facts should be verified.
 
206
 
 
207
 
208
 
209
+ <!-- ## Citation
210
 
211
  **BibTeX:**
212
 
 
222
  **APA:**
223
 
224
  Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kinney, R., Tafjord, O., Jha, A., Ivison, H., Magnusson, I., Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu, K., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J., Khot, T., Merrill, W., Morrison, J., Muennighoff, N., Naik, A., Nam, C., Peters, M., Pyatkin, V., Ravichander, A., Schwenk, D., Shah, S., Smith, W., Subramani, N., Wortsman, M., Dasigi, P., Lambert, N., Richardson, K., Dodge, J., Lo, K., Soldaini, L., Smith, N., & Hajishirzi, H. (2024). OLMo: Accelerating the Science of Language Models. Preprint.
225
+ -->
226
  ## Model Card Contact
227
 
228