|
--- |
|
language: |
|
- en |
|
library_name: transformers |
|
license: cc-by-4.0 |
|
tags: |
|
- kl3m |
|
- kl3m-002 |
|
- legal |
|
- financial |
|
- enterprise |
|
- slm |
|
date: '2024-02-20T00:00:00.000Z' |
|
pipeline_tag: text-generation |
|
widget: |
|
- text: "Medical devices are regulated by" |
|
- temperature: 0.3 |
|
- do_sample: True |
|
--- |
|
|
|
# kl3m-002-520m (Draft) Model |
|
|
|
**This model was part of our scale-up efforts to build `kl3m-003-3.7b`, another Mixtral-architecture model. We are |
|
making this model public for historical reference and research, but you should probably consider using other models |
|
for production purposes.** |
|
|
|
kl3m-520m is a (very) small language model (SLM) model trained on clean, legally-permissible data. Originally |
|
developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai), |
|
kl3m-520m was the first LLM to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications) |
|
for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows, |
|
with a focus on low toxicity and high efficiency. |
|
|
|
Given its small size and lack of instruction-aligned training data, kl3m-520m is best suited for use either in |
|
SLM fine-tuning or as part of training larger models without using unethical data or models. |
|
|
|
The model was originally trained between November 2023 and January 2024 on a 12xRTX4090 node in DDP. A similar model is |
|
being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024. |
|
|
|
## Source |
|
|
|
[https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research) |
|
|
|
|
|
## Training Data |
|
While the original training data collection and training infrastructure relies on software that was not donated by |
|
273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API. |
|
|
|
[https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data) |
|
|
|
Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a |
|
zero-cost distribution model as soon as we can obtain additional support. |
|
|
|
This model, the original `kl3m-002-520m` model, was trained on a US-only subset of the Kelvin Legal DataPack that |
|
we believe is 100% public domain material. However, so as to enforce maximum transparency to all |
|
downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0. |
|
|
|
## Model Details |
|
|
|
### Summary |
|
- **Architecture**: Mixtral (`num_local_experts=4, num_experts_per_tok=2`) |
|
- **Parameters**: 520 million |
|
- **Context Window**: 1,024 tokens (`sliding_window=256`) |
|
- **Language(s)**: Primarily English |
|
- **Tokenizer**: kl3m-001-32k BPE tokenizer (32,768 vocabulary size with unorthodox whitespace handling) |
|
- **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai) |
|
- **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) |
|
- **Hardware Requirements**: Runs real-time in fp32 on CPU/M1+ |
|
|
|
## Performance Metrics |
|
|
|
N/A |
|
|
|
## Key Features |
|
|
|
- **Clean Training Data**: Built on what was originally referred to as the Kelvin Legal DataPack, ensuring all training data is ethically sourced and legally permissible. |
|
- **Low Toxicity**: [Empirically lower toxicity and bias](https://github.com/alea-institute/kl3m-toxicity) |
|
- **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows. |
|
- **Efficient Deployment**: Optimized for real-time inference on consumer hardware. |
|
|
|
## Use Cases |
|
|
|
- Basic regulatory question answering |
|
- Contract provision drafting |
|
- Structured JSON information extraction |
|
- Foundation for downstream optimization |
|
- Base model for domain-specific fine-tuning |
|
|
|
## Getting Started |
|
|
|
```python |
|
import json |
|
from transformers import pipeline |
|
|
|
# Load the model and tokenizer |
|
p = pipeline('text-generation', 'alea-institute/kl3m-002-520m', device='cpu') |
|
|
|
# Example usage on CPU |
|
text = "Under this" |
|
print( |
|
json.dumps( |
|
[ |
|
r.get("generated_text") |
|
for r in p(text, do_sample=True, temperature=0.5, num_return_sequences=3, max_new_tokens=32) |
|
], |
|
indent=2 |
|
) |
|
) |
|
``` |
|
|
|
```json |
|
[ |
|
"Under this rule, the operator of a vessel in the Gulf reef fish fishery ", |
|
"Under this proposed rule, the Department is proposing to amend the regulations in \u00a7\u00a7\u200951.2 ", |
|
"Under this proposed rule, CBP would need to collect information from all entities to perform the necessary" |
|
] |
|
``` |
|
|
|
## Contract Example |
|
```python |
|
text = "Governing Law." |
|
print( |
|
json.dumps( |
|
[ |
|
r.get("generated_text") |
|
for r in p(text, do_sample=True, temperature=0.5, num_return_sequences=3, max_new_tokens=32) |
|
], |
|
indent=2 |
|
) |
|
) |
|
``` |
|
|
|
```json |
|
[ |
|
"Governing Law.\n (a) No provision of this Agreement shall be interpreted or construed to confer ", |
|
"Governing Law.\nThe law of the United States shall be interpreted and enforced in accordance", |
|
"Governing Law.\n (a) The validity of any contract or agreement to which the \nUnited States is " |
|
] |
|
``` |
|
|
|
## Technical Implementation |
|
|
|
The model implements several techniques during training: |
|
|
|
- Hybrid NTP and SFT cotraining |
|
- Dynamic, document-aware segmentation |
|
- Randomized padding |
|
- Traditional fixed-attention mechanisms |
|
|
|
## License |
|
|
|
This model was originally developed by 273 Ventures and has been donated to the ALEA Institute. |
|
|
|
The model weights are released under the CC-BY 4.0 License. |
|
|
|
## Contact |
|
|
|
The KL3M model family is now maintained by the [ALEA Institute](https://aleainstitute.ai). For technical support, collaboration opportunities, or general inquiries: |
|
|
|
- GitHub: https://github.com/alea-institute/kl3m-model-research |
|
- Email: [email protected] |
|
- Website: https://aleainstitute.ai |
|
|
|
## Acknowledgments |
|
|
|
Special thanks to 273 Ventures for developing and donating this model to the open-source community through the Alea Institute. |
|
|
|
|
|
## Citation |
|
|
|
Tokenizer, dataset, and model publications are pending. |
|
|
|
## Contact |
|
|
|
For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or |
|
create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research). |
|
|
|
![https://aleainstitute.ai](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png) |