ipt-350m

ipt-350m is a decoder-style transformer pretrained from scratch on ~13B tokens of Italian text (wip: trained on unfiltered oscar).

It uses a modified transformer architecture optimized for efficient training and inference. Positional embeddings are replaced with Attention with Linear Biases (ALiBi).

ipt-350m is:

If you find this project useful, consider supporting its development: Buy me a coffee

How to Use

import transformers
model = transformers.AutoModelForCausalLM.from_pretrained(
  'efederici/ipt-350m',
  trust_remote_code=True
)

Note: This model requires that trust_remote_code=True be passed to the from_pretrained method.

To use the optimized triton implementation of FlashAttention, you can load the model on GPU (cuda:0) with attn_impl='triton' and with bfloat16 precision:

import torch
import transformers

name = 'efederici/ipt-350m'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.attn_config['attn_impl'] = 'triton'
config.init_device = 'cuda:0'

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  torch_dtype=torch.bfloat16,
  trust_remote_code=True
)

Although the model was trained with a sequence length of 2048, ALiBi enables to increase the maximum sequence length during finetuning and/or inference.

import transformers

name = 'efederici/ipt-350m'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  trust_remote_code=True
)

Model Description

The architecture is a modification of a standard decoder-only transformer.

The model has been modified from a standard transformer in the following ways:

Hyperparameter Value
n_parameters 350M
n_layers 24
n_heads 16
d_model 1024
vocab size 50432
sequence length 2048

Dataset

The model was trained for ~13B tokens (with batch size 64 and sequence length 2048) on OSCAR-2301. Each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.

Vocabulary size is 50432, a multiple of 128 as suggested in MEGATRON-LM, model flop utilization (MFU) increased by up to four percentage points.

Downloads last month
10
Inference Examples
Inference API (serverless) has been turned off for this model.

Model tree for efederici/ipt-350m

Quantizations
1 model

Dataset used to train efederici/ipt-350m