Northern Frisian translation model

This is an NLLB-200-600M model fine-tuned for translating between German and the Northern Frisian dialects of Mooringer Frasch and Wiringhiirder Freesk following this great blogpost.

While the additional data introduced with the new dialect has improved the model's performance for translations German <-> Mooring compared to nllb-deu-moo, the extended training has at the same time degraded the performance for other languages. For example, translating English to Mooring still works relatively well while conversely translating Mooring to English does not.

Data

Mooring <-> German:
The Mooring dataset for finetuning consisted of 9339 sentence pairs. Most examples (roughly 5100) were taken directly from "Rüm Hart" published by the Nordfriisk Instituut. For sentence splitting the python sentence-splitting library was used. The splitting wasn't perfect, especially in cases of direct speech, so that manual re-alignment and further splitting was necessary. Further, the texts about larks from Föögle önj Nordfraschlönj, Marie Tångeberg, 1992 were added, a translation of the story Bulemanns Haus by Theodor Storm, as well as roughly 3000 examples taken from the Frasch Uurdebök, Friesisches Wörterbuch, Neumünster 1988. Finally, a little under 180 very simple self-written examples were used as evaluation data set.
Wiringhiirder <-> German:
The Wiringhiirder dataset consisted of 7529 sentence pairs taken from the books "Di muon fuon e halie" and "Di tofel" by Peter Jensen published by the Nordfriisk Instituut. Similar measures were taken as for Rüm Hart above. For evaluation sentences were collected from Wikipedia, however the evaluation set remains very small and is barely enough to detect overfitting.

Usage

How to use the model:

!pip install transformers==4.33

from transformers import AutoModelForSeq2SeqLM, NllbTokenizer

def create_tokenizer_with_new_langs(model_id, new_langs):
    tokenizer = NllbTokenizer.from_pretrained(model_id)
    for lang in new_langs:
        old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
        new_token_id = old_len - 1
        if new_lang in tokenizer.added_tokens_encoder:
            new_token_id = tokenizer.added_tokens_encoder[new_lang] - 1
        tokenizer.lang_code_to_id[new_lang] = new_token_id
        tokenizer.id_to_lang_code[new_token_id] = new_lang
        # always move "mask" to the last position
        tokenizer.fairseq_tokens_to_ids["<mask>"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset
    
        tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)
        tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}
        if new_lang not in tokenizer._additional_special_tokens:
            tokenizer._additional_special_tokens.append(new_lang)
        # clear the added token encoder; otherwise a new token may end up there by mistake
        tokenizer.added_tokens_encoder = {}
        tokenizer.added_tokens_decoder = {}

    return tokenizer

def translate(
    text,
    tokenizer,
    model,
    src_lang='moo_Latn',
    tgt_lang='deu_Latn',
    a=32,
    b=3,
    max_input_length=1024,
    num_beams=4,
    **kwargs
):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
        num_beams=num_beams,
        **kwargs
    )
    return tokenizer.batch_decode(result, skip_special_tokens=True)

path = "CmdCody/nllb-deu-frr"
tokenizer = create_tokenizer_with_new_langs(path, ['moo_Latn', 'wir_Latn'])
model = AutoModelForSeq2SeqLM.from_pretrained(path)

translate("Momme booget önj Naibel", tokenizer=tokenizer, model=model)

Training

The model was trained in a Google Colab notebook for 4 epochs and a batch size of 16 following the above mentioned blog post with two notable adaptations:

The data iteration was changed to make sure that the model sees each example in the dataset exactly once per epoch.
After tokenization and batching the complete data set is shuffled before each epoch so that all translation directions are mixed. However, each batch only contains examples for one direction.

Evaluation

Metrics on the evaluation data sets:

	Bleu	ChrF++
Moo -> Deu	55.78	70.73
Deu -> Moo	50.19	67.76
Wir -> Deu	67.22	80.16
Deu -> Wir	42.35	61.08

Note: As mentioned above the Wiringhiirder evaluation set is very small and the resulting metrics should not be compared with the Mooring metrics.

CmdCody
/

nllb-deu-frr

Northern Frisian translation model

Data

Usage

Training

Evaluation

Model tree for CmdCody/nllb-deu-frr