ModernBERT-base-zeroshot-v2.0
Model description
This model is answerdotai/ModernBERT-large
fine-tuned on the same dataset mix as the zeroshot-v2.0
models in the Zeroshot Classifiers Collection.
General takeaways:
- The model is very fast and memory efficient. It's multiple times faster and consumes multiple times less memory than DeBERTav3.
The memory efficiency enables larger batch sizes. I got a ~2x speed increase by enabling bf16 (instead of fp16).
- It performs slightly worse then DeBERTav3 on average on the tasks tested below.
- I'm in the process of preparing a newer version trained on better synthetic data to make full use of the 8k context window
and to update the training mix of the older
zeroshot-v2.0
models.
Training results
Datasets |
Mean |
Mean w/o NLI |
mnli_m |
mnli_mm |
fevernli |
anli_r1 |
anli_r2 |
anli_r3 |
wanli |
lingnli |
wellformedquery |
rottentomatoes |
amazonpolarity |
imdb |
yelpreviews |
hatexplain |
massive |
banking77 |
emotiondair |
emocontext |
empathetic |
agnews |
yahootopics |
biasframes_sex |
biasframes_offensive |
biasframes_intent |
financialphrasebank |
appreviews |
hateoffensive |
trueteacher |
spam |
wikitoxic_toxicaggregated |
wikitoxic_obscene |
wikitoxic_identityhate |
wikitoxic_threat |
wikitoxic_insult |
manifesto |
capsotu |
Accuracy |
0.85 |
0.851 |
0.942 |
0.944 |
0.894 |
0.812 |
0.717 |
0.716 |
0.836 |
0.909 |
0.815 |
0.899 |
0.964 |
0.951 |
0.984 |
0.814 |
0.8 |
0.744 |
0.752 |
0.802 |
0.544 |
0.899 |
0.735 |
0.934 |
0.864 |
0.877 |
0.913 |
0.953 |
0.921 |
0.821 |
0.989 |
0.901 |
0.927 |
0.931 |
0.959 |
0.911 |
0.497 |
0.73 |
F1 macro |
0.834 |
0.835 |
0.935 |
0.938 |
0.882 |
0.795 |
0.688 |
0.676 |
0.823 |
0.898 |
0.814 |
0.899 |
0.964 |
0.951 |
0.984 |
0.77 |
0.753 |
0.763 |
0.69 |
0.805 |
0.533 |
0.899 |
0.729 |
0.925 |
0.864 |
0.877 |
0.901 |
0.953 |
0.855 |
0.821 |
0.983 |
0.901 |
0.927 |
0.931 |
0.952 |
0.911 |
0.362 |
0.662 |
Inference text/sec (A100 40GB GPU, batch=32) |
1116.0 |
1104.0 |
1039.0 |
1241.0 |
1138.0 |
1102.0 |
1124.0 |
1133.0 |
1251.0 |
1240.0 |
1263.0 |
1231.0 |
1054.0 |
559.0 |
795.0 |
1238.0 |
1312.0 |
1285.0 |
1273.0 |
1268.0 |
992.0 |
1222.0 |
894.0 |
1176.0 |
1194.0 |
1197.0 |
1206.0 |
1166.0 |
1227.0 |
541.0 |
1199.0 |
1045.0 |
1054.0 |
1020.0 |
1005.0 |
1063.0 |
1214.0 |
1220.0 |
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 9e-06
- train_batch_size: 16
- eval_batch_size: 32
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 32
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.06
- num_epochs: 2
Framework versions
- Transformers 4.48.0.dev0
- Pytorch 2.5.1+cu124
- Datasets 3.2.0
- Tokenizers 0.21.0