SkitCon/gec-spanish-BETO-TOKEN-COWS-L2H

This model has been trained on 80% of the COWS-L2H dataset for grammatical error correction of Spanish text. The corpus was sentencized, so the model has been fine-tuned for SENTENCE CORRECTION. This model will likely not perform well on an entire paragraph. To correct a paragraph, sentencize the text and run the model for each sentence.

BLEU: 0.735 on COWS-L2H

This model requires imports from a Github as it has not been integrated with the Transformers library yet. Clone the repository: https://github.com/SkitCon/synth_gec_es. All imports in example code are from the top-level of this repo.

This model uses a custom token-level transformation schema to correct text. Therefore, it is framed as a token-classification task. The output of the forward pass of the model is two logit vectors which are used to predict token labels.

WARNING: This model needs to be improved. In general, the model is good at detecting when a sentence is grammatically-correct and correcting simple errors (e.g. gender disagreement, subject-verb disagreement), but in more complicated cases (e.g. incorrect mood) the model may fail to detect an error. Additionally, the model will almost always detect when a word should be replaced, but it fails to choose the correct replacement. Because of this, the logit output for the replacement usually chooses [MASK].

Example usage:

from transformers import AutoTokenizer
from models.model import BETOTokenLevelGECModel
from utils.utils import load_modified_nlp, load_vocab, load_morpho_dict

tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-cased")
token_base_model = BETOTokenLevelGECModel.from_pretrained("SkitCon/gec-spanish-BETO-TOKEN-COWS-L2H")

lemma_to_morph = load_morpho_dict("lang_def/morpho_dict_updated.json")
vocab = load_vocab("lang_def/vocab.txt")
nlp = load_modified_nlp()

input_sentences = ["Yo va al tienda.", "Espero que tú ganas."]

tokenized_text = tokenizer(input_sentences, max_length=128, padding="max_length", truncation=True, return_tensors="pt")

input_ids = tokenized_text["input_ids"].squeeze()
attention_mask = tokenized_text["attention_mask"].squeeze()

# Get logits for both label parts
type_logits, param_logits = model(input_ids=input_ids, attention_mask=attention_mask)

# Transforms type_logits and param_logits into label vectors, converts to text labels, and applies decoder from Github
correct_sentences = model.batch_decode_from_logits(input_sentences, type_logits, param_logits, lemma_to_morph, vocab, nlp)

for sentence in correct_sentences:
  print(sentence)

SkitCon
/

gec-spanish-BETO-TOKEN-COWS-L2H

Model tree for SkitCon/gec-spanish-BETO-TOKEN-COWS-L2H