Roberta Zinc 480m
This is a Roberta style masked language model trained on ~480m SMILES strings from the ZINC database. The model has ~102m parameters and was trained for 150000 iterations with a batch size of 4096 to a validation loss of ~0.122. This model is useful for generating embeddings from SMILES strings.
from transformers import RobertaTokenizerFast, RobertaForMaskedLM, DataCollatorWithPadding
tokenizer = RobertaTokenizerFast.from_pretrained("entropy/roberta_zinc_480m", max_len=128)
model = RobertaForMaskedLM.from_pretrained('entropy/roberta_zinc_480m')
collator = DataCollatorWithPadding(tokenizer, padding=True, return_tensors='pt')
smiles = ['Brc1cc2c(NCc3ccccc3)ncnc2s1',
'Brc1cc2c(NCc3ccccn3)ncnc2s1',
'Brc1cc2c(NCc3cccs3)ncnc2s1',
'Brc1cc2c(NCc3ccncc3)ncnc2s1',
'Brc1cc2c(Nc3ccccc3)ncnc2s1']
inputs = collator(tokenizer(smiles))
outputs = model(**inputs, output_hidden_states=True)
full_embeddings = outputs[1][-1]
mask = inputs['attention_mask']
embeddings = ((full_embeddings * mask.unsqueeze(-1)).sum(1) / mask.sum(-1).unsqueeze(-1))
Decoder
There is also a decoder model trained to reconstruct inputs from embeddings
license: mit
- Downloads last month
- 779
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.