LightGPT

LightGPT is a lightweight generative pre-trained Transformer (GPT) model for the people! Built using pure PyTorch, LightGPT can answer questions, summarize documents, chat, and more. A unique feature of LightGPT is that you can train larger models on smaller hardware by progressively enabling memory-saving features at train time such as activation checkpointing, mixed-precision, and ZeRO redundancy distributed pre-training using fully-sharded data parallel (FSDP).

Features

Parameter-efficiency: LightGPT aims to be a more parsimonious model by only training parameters that are absolutely necessary. As such, biases and positional embeddings have been completely removed from the architecture. In addition, the token embeddings and output layer share weight matrices resulting in a buy-one-get-one-free deal on trainable parameters.
Low Memory Utilization: LightGPT employs a number of training-time optimizations that conserve precious GPU memory. With zero-redundancy distributed pre-training using fully-sharded data-parallel (FSDP), activation checkpointing, and automatic mixed precision, you'll be able to train larger models by accepting a relatively small amount of overhead.
Fully Open-source: Unlike closed-source LLMs, LightGPT provides both the model weights and the source code to train, fine-tune, and generate text from the model using your own hardware. With the help of the open-source software community, we aim to democratize AI and continually improve the models.

Default Configurations

Below is a table of recommended default model training configurations but feel free to experiment with settings on your own. See the model_sizing.ipynb notebook to estimate the memory and compute requirements for your model configuration.

Name	Vocab. Size	Block Size	Embedding Dim.	Attn. Heads	Layers	Params	Train Tokens
Small	50,257	1024	1024	16	32	454M	10B
Medium	50,257	1024	2048	32	32	1.7B	20B
Large	100,275	2048	4096	64	32	6.8B	100B
X-large	100,275	2048	4096	64	64	13B	350B
XX-large	200,017	4096	8192	128	64	53B	1T
XXX-large	200,017	4096	8192	128	128	105B	3T

Install Project Dependencies

Project dependencies are specified in the requirements.txt file. You can install them with pip using the following command from the project root. We recommend using a virtual environment such as venv to keep package dependencies on your system tidy.

python -m venv ./.venv

source ./.venv/bin/activate

pip install -r requirements.txt

Pre-training

For the pre-training corpus we use the Fineweb dataset which consists of about 15T high-quality tokens gathered from the worldwide web. The dataset has been split into 3 subsets (10BT, 100BT, and 350BT versions) for training smaller models. If you'd like to start training right away, the default settings should work on most single-GPU systems with 12G of VRAM or more.

python pre-train.py

Note that it will take a while to download and pre-process the dataset the first time that the training script is run.

To customize the default "lightgpt-small" architecture you can adjust the block_size, embedding_dimensions, num_hidden_layers, and num_attention_heads arguments of the pre-training script. Refer to the model_sizing.ipynb notebook for an estimation of the memory and compute requirements for your chosen architecture.

python pre-train.py --block_size=2048 --embedding_dimensions=4096 --num_hidden_layers=64 --num_attention_heads=64

You can also adjust the batch_size, learning_rate, and gradient_accumulation_steps to suite your training setup.

python pre-train.py --batch_size=32 --learning_rate=0.01 --gradient_accumulation_steps=128

For distributed training, use PyTorch's torchrun extension to launch a distributed data parallel session. The example below is for executing the training script on a single node with individual 8 GPUs.

torchrun --standalone --nnodes=1 --nproc-per-node=8 pre-train.py --batch_size=16 --gradient_accumulation_steps=128

Note that when training in data-parallel mode it's important that the gradient_accumulation_steps divides evenly into the world size for maximum performance. For example, if we have an 8 GPU cluster, we could perform 32 gradient accumulation steps in exactly 4 passes over the network.

Text Generation

After training, you can generate text from the model by running the generate.py script from the commandline. This inference script samples tokens from the model one at a time conditioned on a prompt and any previously generated tokens, together referred to as the context window. In the example below we are choosing to only sample from the top_k predicted tokens that have at least top_p cumulative probability mass when ordered descending by predicted probability.

python generate.py --top_k=500 --top_p=0.9

We also provide a script that samples entire sequences rather than single tokens independently which we call beam_search.py. Beam Search maintains a list of the top beam_width sequence candidates and outputs the top num_candidates completed sequences with the highest overall priority. It is a form of greedy search that works well for some things like text summarization and translation but often results in less natural responses as natural language follows a more stochastic process.

python beam_search.py --beam_width=16 --num_candidates=3

Instruction-tuning

Soon ...

Pre-training Arguments

Argument	Default	Type	Description
--dataset_subset	"sample-10BT"	str	The subset of the Fineweb dataset to train on. Options are `sample-10BT`, `sample-100BT`, and `sample-350BT`. Set to `None` to train on the full 15T token dataset.
--token_encoding	"r50k_base"	str	The encoding scheme to use when tokenizing the dataset. Options include `r50k_base`, `cl100k_base`, and `o200k_base`.
--dataset_path	"./dataset"	str	The path to the preprocessed dataset files on disk.
--num_dataset_processes	8	int	The number of processes (CPUs) to use to process the dataset.
--batch_size	1	int	The number of samples to pass through the network at a time.
--gradient_accumulation_steps	128	int	The number of batches to pass through the network before updating the weights.
--samples_per_epoch	4096	int	The number of training samples to pass through the network every epoch.
--learning_rate	5e-4	float	The global step size taken after every gradient accumulation step.
--max_gradient_norm	1.0	float	Clip gradients above this threshold before stepping.
--num_epochs	2384	int	The number of epochs to train for.
--eval_interval	10	int	Evaluate the model after this many epochs on the testing set.
--block_size	1024	int	The number of tokens within the context window for every sample.
--embedding_dimensions	1024	int	The dimensionality of the token embeddings.
--num_attention_heads	16	int	The number of attention heads within every block.
--num_hidden_layers	32	int	The number of attention/MLP blocks within the hidden layer of the network.
--dropout	0.1	float	The proportion of signals to send to zero during training as regularization.
--activation_checkpointing	False	bool	Should we use activation checkpointing? This will drastically reduce memory utilization at the cost of about 30% more runtime per epoch.
--ddp_sharding_level	2	int	The level of sharding to use for DDP training. Options are 2 or 3 for partial and full sharding respectively, or 0 for no sharding.
--checkpoint_interval	20	int	Save the model parameters to disk every this many epochs.
--checkpoint_path	"./out/checkpoint.pt"	str	The path to the checkpoint file on disk.
--resume	False	bool	Should we resume training from the last checkpoint?
--device	"cuda"	str	The device to run the computation on.
--seed	None	int	The seed for the random number generator.

Instruction-tuning Arguments

Argument	Default	Type	Description
--base_model_path	"./out/checkpoint.pt"	string	The path to the pre-trained model.
--batch_size	1	int	The number of samples to pass through the network at a time.
--gradient_accumulation_steps	128	int	The number of batches to pass through the network before updating the weights.
--learning_rate	5e-4	float	The global step size taken after every gradient accumulation step.
--mask_input	False	bool	Should we mask the input part of the sample i.e. only train on the output?
--rank	8	int	The rank of the LoRA decomposition matrices.
--alpha	1.0	float	The strength of the LoRA signal.
--dropout	0.05	float	The proportion of signals to send to zero during training as regularization.
--num_epochs	4	int	The number of epochs to train for.
--eval_interval	1	int	Evaluate the model after this many epochs on the testing set.
--checkpoint_interval	1	int	Save the model parameters to disk every this many epochs.
--checkpoint_path	"./out/lora_instruction.pt"	string	The path to the checkpoint file on disk.
--resume	False	bool	Should we resume training from the last checkpoint?
--device	"cuda"	string	The device to run the computation on.
--seed	None	int	The seed for the random number generator.

Generation Arguments

Argument	Default	Type	Description
--checkpoint_path	"./out/checkpoint.pt"	string	The path to the checkpoint file on disk.
--lora_path	None	string	The path to the LoRA checkpoint.
--max_tokens	500	int	The maximum number of tokens that the model should generate per sample.
--temperature	1.0	float	The amount of regularization applied to the candidate token probabilities.
--top_k	500	int	Only sample from this many candidate tokens with the highest probabilities.
--top_p	0.9	float	Of the `top_k` tokens, drop all but the `top_p` portion of the cumulative probability distribution.
--device	"cuda"	string	The device to run the computation on.
--seed	None	int	The seed for the random number generator.

Beam Search Arguments

Argument	Default	Type	Description
--checkpoint_path	"./out/checkpoint.pt"	string	The path to the checkpoint file on disk.
--lora_path	None	string	The path to the LoRA checkpoint.
--max_tokens	200	int	The maximum number of tokens that the model should generate per sample.
--num_candidates	3	int	The number of candidate sequences to output.
--beam_width	16	int	The number of candidate sequences to keep track of during search.
--device	"cuda"	string	The device to run the computation on.
--seed	None	int	The seed for the random number generator.

References:

A. Radford, et al. Language Models are Unsupervised Multitask Learners, OpenAI, 2019.

T. Brown, et al. Language Models are Few-Shot Learners. OpenAI, 2020.

A. Kazemnejad, et al. The Impact of Positional Encoding on Length Generalization in Transformers, 37th Conference on Neural Information Processing Systems (NeurIPS 2023).

S. Rajbhandari, et al. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, 2020.

J. R. Hermans, et al. Accumulated Gradient Normalization, JMLR: Workshop and Conference Proceedings, 2017.

T. Chen, et al. Training Deep Nets with Sublinear Memory Cost. MIT, 2019.

B. Zhang, et al. Root Mean Square Layer Normalization. 33rd Conference on Neural Information Processing Systems, NeurIPS 2019.

andrewdalpino
/

LightGPT