allenai
/

OLMo-2-1124-13B

Safetensors

English

olmo2

Model card Files Files and versions Community

amanrangapur commited on Nov 26, 2024

Commit

5dbe404

verified ·

1 Parent(s): ae29fed

Update README.md

Browse files

Files changed (1) hide show

README.md +17 -20

README.md CHANGED Viewed

@@ -154,22 +154,22 @@ Both stages contribute equally to the final performance of the OLMo model. After
 ### Architecture
-OLMo 13B architecture with peer models for comparison.
-|                        | **OLMo 13B** | [OLMo 7B](https://huggingface.co/allenai/OLMo-7B-hf) | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) | PaLM 8B |
 |------------------------|-------------------|-------------------|---------------------|--------------------|--------------------|------------------|
 | d_model                | 4096              | 4096              | 4096                | 4096               | 4544               | 4096             |
-| num heads              | 32                | 32                | 32                  | 32                 | 71                 | 16               |
-| num layers             | 32                | 32                | 32                  | 32                 | 32                 | 32               |
 | MLP ratio              | ~8/3              | ~8/3              | ~8/3                | ~8/3               | 4                  | 4                |
-| LayerNorm type         | non-parametric LN | non-parametric LN | RMSNorm             | parametric LN      | parametric LN      | parametric LN    |
 | pos embeddings         | RoPE              | RoPE              | RoPE                | RoPE               | RoPE               | RoPE             |
 | attention variant      | full              | full              | GQA                 | full               | MQA                | MQA              |
 | biases                 | none              | none              | none                | in LN only         | in LN only         | none             |
 | block type             | sequential        | sequential        | sequential          | sequential         | parallel           | parallel         |
 | activation             | SwiGLU            | SwiGLU            | SwiGLU              | SwiGLU             | GeLU               | SwiGLU           |
-| sequence length        | 4096              | 2048              | 4096                | 2048               | 2048               | 2048             |
-| batch size (instances) | 1024              | 2160              | 1024                | 2048               | 2304               | 512              |
 | batch size (tokens)    | ~4M               | ~4M               | ~4M                 | ~4M                | ~4M                | ~1M              |
 | weight tying           | no                | no                | no                  | no                 | no                 | yes              |
@@ -180,36 +180,33 @@ AdamW optimizer parameters are shown below.
 | Size | Peak LR    | Betas           | Epsilon     | Weight Decay |
 |------|------------|-----------------|-------------|--------------|
-| 7B   | 3.0E-4   | (0.9, 0.95)   | 1.0E-5    | 0.1          |
-| 13B   | 9.0E-4   | (0.9, 0.95)   | 1.0E-5    | 0.1          |
 Optimizer settings comparison with peer models.
-|                       | **OLMo2 7B**  | [OLMo 1.0 7B](https://huggingface.co/allenai/OLMo-7B-hf) | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) |
 |-----------------------|------------------|------------------|---------------------|--------------------|--------------------|
-| warmup steps          | 2000             | 5000             | 2000                | 2000               | 1000               |
-| peak LR               | 4.0E-04          | 3.0E-04          | 3.0E-04             | 3.0E-04            | 6.0E-04            |
 | minimum LR            | 3.0E-05          | 3.0E-05          | 3.0E-05             | 3.0E-05            | 1.2E-05            |
 | weight decay          | 0.1              | 0.1              | 0.1                 | 0.1                | 0.1                |
 | beta1                 | 0.9              | 0.9              | 0.9                 | 0.9                | 0.99               |
 | beta2                 | 0.95             | 0.95             | 0.95                | 0.95               | 0.999              |
-| epsilon               | 1.0E-05          | 1.0E-05          | 1.0E-05             | 1.0E-05            | 1.0E-05            |
-| LR schedule           | cosine           | linear           | cosine              | cosine             | cosine             |
 | gradient clipping     | global 1.0       | global 1.0       | global 1.0          | global 1.0         | global 1.0         |
 | gradient reduce dtype | FP32             | FP32             | FP32                | FP32               | BF16               |
 | optimizer state dtype | FP32             | FP32             | most likely FP32    | FP32               | FP32               |
 ## Bias, Risks, and Limitations
-Like any base language model or fine-tuned model without safety filtering, it is relatively easy for a user to prompt these models to generate harmful and generally sensitive content.
-Such content can also be produced unintentionally, especially in the case of bias, so we recommend users consider the risks of applications of this technology.
-Otherwise, many facts from OLMo or any LLM will often not be true, so they should be checked.
-## Citation
 **BibTeX:**
@@ -225,7 +222,7 @@ Otherwise, many facts from OLMo or any LLM will often not be true, so they shoul
 **APA:**
 Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kinney, R., Tafjord, O., Jha, A., Ivison, H., Magnusson, I., Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu, K., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J., Khot, T., Merrill, W., Morrison, J., Muennighoff, N., Naik, A., Nam, C., Peters, M., Pyatkin, V., Ravichander, A., Schwenk, D., Shah, S., Smith, W., Subramani, N., Wortsman, M., Dasigi, P., Lambert, N., Richardson, K., Dodge, J., Lo, K., Soldaini, L., Smith, N., & Hajishirzi, H. (2024). OLMo: Accelerating the Science of Language Models. Preprint.
 ## Model Card Contact

 ### Architecture
+OLMo 7B architecture with peer models for comparison.
+|                        | **OLMo2 7B** | [OLMo2 13B](https://huggingface.co/allenai/OLMo2-13B-1124) | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) | PaLM 8B |
 |------------------------|-------------------|-------------------|---------------------|--------------------|--------------------|------------------|
 | d_model                | 4096              | 4096              | 4096                | 4096               | 4544               | 4096             |
+| num heads              | 32                | 42                | 32                  | 32                 | 71                 | 16               |
+| num layers             | 32                | 40                | 32                  | 32                 | 32                 | 32               |
 | MLP ratio              | ~8/3              | ~8/3              | ~8/3                | ~8/3               | 4                  | 4                |
+| LayerNorm type         | RMS Norm          | RMS Norm          | RMSNorm             | parametric LN      | parametric LN      | parametric LN    |
 | pos embeddings         | RoPE              | RoPE              | RoPE                | RoPE               | RoPE               | RoPE             |
 | attention variant      | full              | full              | GQA                 | full               | MQA                | MQA              |
 | biases                 | none              | none              | none                | in LN only         | in LN only         | none             |
 | block type             | sequential        | sequential        | sequential          | sequential         | parallel           | parallel         |
 | activation             | SwiGLU            | SwiGLU            | SwiGLU              | SwiGLU             | GeLU               | SwiGLU           |
+| sequence length        | 4096              | 4096              | 4096                | 2048               | 2048               | 2048             |
+| batch size (instances) | 1024              | 2048              | 1024                | 2048               | 2304               | 512              |
 | batch size (tokens)    | ~4M               | ~4M               | ~4M                 | ~4M                | ~4M                | ~1M              |
 | weight tying           | no                | no                | no                  | no                 | no                 | yes              |
 | Size | Peak LR    | Betas           | Epsilon     | Weight Decay |
 |------|------------|-----------------|-------------|--------------|
+| 7B   | 3.0E-4   | (0.9, 0.95)   | 1.0E-8    | 0.1          |
+| 13B   | 9.0E-4   | (0.9, 0.95)   | 1.0E-8    | 0.1          |
 Optimizer settings comparison with peer models.
+|                       | **OLMo2 7B**  | [OLMo2 13B](https://huggingface.co/allenai/OLMo2-13B-1124) | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) |
 |-----------------------|------------------|------------------|---------------------|--------------------|--------------------|
+| warmup steps          | 2000             | 2000             | 2000                | 2000               | 1000               |
+| peak LR               | 3.0E-04          | 9.0E-04          | 3.0E-04             | 3.0E-04            | 6.0E-04            |
 | minimum LR            | 3.0E-05          | 3.0E-05          | 3.0E-05             | 3.0E-05            | 1.2E-05            |
 | weight decay          | 0.1              | 0.1              | 0.1                 | 0.1                | 0.1                |
 | beta1                 | 0.9              | 0.9              | 0.9                 | 0.9                | 0.99               |
 | beta2                 | 0.95             | 0.95             | 0.95                | 0.95               | 0.999              |
+| epsilon               | 1.0E-08          | 1.0E-08          | 1.0E-05             | 1.0E-05            | 1.0E-05            |
+| LR schedule           | cosine           | cosine           | cosine              | cosine             | cosine             |
 | gradient clipping     | global 1.0       | global 1.0       | global 1.0          | global 1.0         | global 1.0         |
 | gradient reduce dtype | FP32             | FP32             | FP32                | FP32               | BF16               |
 | optimizer state dtype | FP32             | FP32             | most likely FP32    | FP32               | FP32               |
 ## Bias, Risks, and Limitations
+Like any base language model or fine-tuned model without safety filtering, these models can easily be prompted by users to generate harmful and sensitive content. Such content may also be produced unintentionally, especially in cases involving bias, so we recommend that users consider the risks when applying this technology. Additionally, many statements from OLMo or any LLM are often inaccurate, so facts should be verified.
+<!-- ## Citation
 **BibTeX:**
 **APA:**
 Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kinney, R., Tafjord, O., Jha, A., Ivison, H., Magnusson, I., Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu, K., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J., Khot, T., Merrill, W., Morrison, J., Muennighoff, N., Naik, A., Nam, C., Peters, M., Pyatkin, V., Ravichander, A., Schwenk, D., Shah, S., Smith, W., Subramani, N., Wortsman, M., Dasigi, P., Lambert, N., Richardson, K., Dodge, J., Lo, K., Soldaini, L., Smith, N., & Hajishirzi, H. (2024). OLMo: Accelerating the Science of Language Models. Preprint.
+ -->
 ## Model Card Contact