julian-schelb commited on
Commit
1b53e0f
·
1 Parent(s): 1733f36

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -60
README.md CHANGED
@@ -1,23 +1,54 @@
1
  ---
2
- language:
3
  - en
4
  - de
5
  - fr
6
- - zh
 
 
 
 
 
 
 
 
 
 
 
7
  - ne
8
- - multilingual
 
 
 
 
 
9
  widget:
10
- - text: "In December 1903 in France the Royal Swedish Academy of Sciences awarded Pierre Curie, Marie Curie, and Henri Becquerel the Nobel Prize in Physics."
11
- - text: "Für Richard Phillips Feynman war es immer wichtig in New York, die unanschaulichen Gesetzmäßigkeiten der Quantenphysik Laien und Studenten nahezubringen und verständlich zu machen."
12
- - text: "My name is Julian and I live in Constance"
13
- - text: "Terence David John Pratchett est né le 28 avril 1948 à Beaconsfield dans le Buckinghamshire, en Angleterre."
14
- - text: "北京市,通称北京(汉语拼音:Běijīng;邮政式拼音:Peking),简称“京”,是中华人民共和国的首都及直辖市,是该国的政治、文化、科技、教育、军事和国际交往中心,是一座全球城市,是世界人口第三多的城市和人口最多的首都,具有重要的国际影响力,同時也是目前世界唯一的“双奥之城”,即唯一既主办过夏季"
15
- - text: "काठमाडौँ नेपालको सङ्घीय राजधानी र नेपालको सबैभन्दा बढी जनसङ्ख्या भएको सहर हो।"
 
 
 
 
 
 
 
 
16
  tags:
17
  - roberta
 
 
18
  license: mit
19
  datasets:
20
  - wikiann
 
 
 
 
 
21
  ---
22
 
23
  # RoBERTa for Multilingual Named Entity Recognition
@@ -30,70 +61,55 @@ This model detects entities by classifying every token according to the IOB form
30
  ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
31
  ```
32
 
33
- **Languages:**
34
-
35
- TBD
36
-
37
  ## Training data
38
 
39
- This mode was traind using a subset of all [wikiann](https://huggingface.co/datasets/wikiann) dataset.
40
-
41
- ## Evaluation results
42
-
43
- This model achieves the following results (meassured using the validation portion of the [wikiann](https://huggingface.co/datasets/wikiann)):
44
 
45
  ```python
46
- {'LOC': {'f1': 0.8541617653978262,
47
- 'number': 42016,
48
- 'precision': 0.8444273885942878,
49
- 'recall': 0.8641231911652704},
50
- 'ORG': {'f1': 0.7504633739856393,
51
- 'number': 31226,
52
- 'precision': 0.7305394669011736,
53
- 'recall': 0.7715045154678793},
54
- 'PER': {'f1': 0.8639735635284596,
55
- 'number': 29647,
56
- 'precision': 0.863711444463172,
57
- 'recall': 0.8642358417377812},
58
- 'overall_accuracy': 0.926459490605155,
59
- 'overall_f1': 0.8250250567072849,
60
- 'overall_precision': 0.814290312198262,
61
- 'overall_recall': 0.8360466133405904}
62
-
63
  ```
64
 
65
- **Per Entity Type:**
66
-
67
- TBD
68
-
69
- **Per Language:**
70
-
71
- TBD
72
 
 
73
 
74
- ## About RoBERTa
75
-
76
- This model is a fine-tuned version of [XLM-RoBERTa](https://huggingface.co/xlm-roberta-large). The original model was pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Conneau et al. and first released in [this repository](https://github.com/pytorch/fairseq/tree/master/examples/xlmr).
77
-
78
- RoBERTa is a transformers model pretrained on a large corpus in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.
79
-
80
- More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
81
-
82
- This way, the model learns an inner representation of 100 languages that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the XLM-RoBERTa model as inputs.
83
 
84
- #### Limitations and bias
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
 
86
- This model is limited by its training dataset of entity-annotated news articles from a specific span of time. This may not generalize well for all use cases in different domains.
87
 
88
  ## Usage
89
 
90
- You can use this model by using the AutoTokenize and AutoModelForTokenClassification class:
91
 
92
  ```python
93
  from transformers import AutoTokenizer, AutoModelForTokenClassification
94
 
95
- tokenizer = AutoTokenizer.from_pretrained("julian-schelb/roberta-ner-multilingual/", add_prefix_space=True)
96
- model = AutoModelForTokenClassification.from_pretrained("julian-schelb/roberta-ner-multilingual/")
97
 
98
  text = "In December 1903 in France the Royal Swedish Academy of Sciences awarded Pierre Curie, Marie Curie, and Henri Becquerel the Nobel Prize in Physics."
99
 
@@ -115,8 +131,24 @@ predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_t
115
  predicted_tokens_classes
116
  ```
117
 
118
- ### BibTeX entry and citation info
119
 
120
- ```bibtex
121
- TBD
122
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
  - en
4
  - de
5
  - fr
6
+ - zh
7
+ - it
8
+ - es
9
+ - hi
10
+ - bn
11
+ - ar
12
+ - ru
13
+ - uk
14
+ - pt
15
+ - ur
16
+ - id
17
+ - ja
18
  - ne
19
+ - nl
20
+ - tr
21
+ - ca
22
+ - bg
23
+ - yue
24
+
25
  widget:
26
+ - text: >-
27
+ In December 1903 in France the Royal Swedish Academy of Sciences awarded
28
+ Pierre Curie, Marie Curie, and Henri Becquerel the Nobel Prize in Physics.
29
+ - text: >-
30
+ Für Richard Phillips Feynman war es immer wichtig in New York, die
31
+ unanschaulichen Gesetzmäßigkeiten der Quantenphysik Laien und Studenten
32
+ nahezubringen und verständlich zu machen.
33
+ - text: My name is Julian and I live in Constance.
34
+ - text: >-
35
+ Terence David John Pratchett est né le 28 avril 1948 à Beaconsfield dans le
36
+ Buckinghamshire, en Angleterre.
37
+ - text: >-
38
+ 北京市,通称北京(汉语拼音:Běijīng;邮政式拼音:Peking),简称“京”,是中华人民共和国的首都及直辖市,是该国的政治、文化、科技、教育、军事和国际交往中心,是一座全球城市,是世界人口第三多的城市和人口最多的首都,具有重要的国际影响力,同時也是目前世界唯一的“双奥之城”,即唯一既主办过夏季
39
+
40
  tags:
41
  - roberta
42
+ - ner
43
+ - nlp
44
  license: mit
45
  datasets:
46
  - wikiann
47
+ metrics:
48
+ - f1
49
+ - precision
50
+ - accuracy
51
+ - recall
52
  ---
53
 
54
  # RoBERTa for Multilingual Named Entity Recognition
 
61
  ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
62
  ```
63
 
 
 
 
 
64
  ## Training data
65
 
66
+ This model was fine-tuned on a portion of the [wikiann](https://huggingface.co/datasets/wikiann) dataset corresponding to the following languages:
 
 
 
 
67
 
68
  ```python
69
+ ["en","de", "fr",
70
+ "zh", "it", "es",
71
+ "hi", "bn", "ar",
72
+ "ru", "uk", "pt",
73
+ "ur", "id", "ja",
74
+ "ne", "nl", "tr",
75
+ "ca", "bg", "zh-yue"]
 
 
 
 
 
 
 
 
 
 
76
  ```
77
 
78
+ The model was fine-tuned on 375.100 sentences in the training set, with a validation set of 173.100 examples. Performance metrics reported are based on additional 173.100 examples. The complete WikiANN dataset includes training examples for 282 languages and was constructed from Wikipedia. Training examples are extracted in an automated manner, exploiting entities mentioned in Wikipedia articles, often are formatted as hyperlinks to the source article. Provided NER tags are in the IOB2 format. Named entities are classified as location (LOC), person (PER), or organization (ORG).
 
 
 
 
 
 
79
 
80
+ ## Evaluation results
81
 
82
+ This model achieves the following results (meassured using the test split of the [wikiann](https://huggingface.co/datasets/wikiann) dataset):
 
 
 
 
 
 
 
 
83
 
84
+ ```python
85
+ {'LOC': {'f1': 0.9310524680196053,
86
+ 'number': 545516,
87
+ 'precision': 0.9230957726278464,
88
+ 'recall': 0.9391475227124411},
89
+ 'ORG': {'f1': 0.884603763901478,
90
+ 'number': 363324,
91
+ 'precision': 0.8868243944134171,
92
+ 'recall': 0.8823942266406843},
93
+ 'PER': {'f1': 0.939167449173159,
94
+ 'number': 367750,
95
+ 'precision': 0.934642687866253,
96
+ 'recall': 0.9437362338545208},
97
+ 'overall_accuracy': 0.9588396024156357,
98
+ 'overall_f1': 0.9202625613733114,
99
+ 'overall_precision': 0.9162434124141294,
100
+ 'overall_recall': 0.9243171260937341}
101
 
102
+ ```
103
 
104
  ## Usage
105
 
106
+ You can load this model by using the AutoTokenize and AutoModelForTokenClassification class:
107
 
108
  ```python
109
  from transformers import AutoTokenizer, AutoModelForTokenClassification
110
 
111
+ tokenizer = AutoTokenizer.from_pretrained("julian-schelb/roberta-ner-multilingual-wikiann/", add_prefix_space=True)
112
+ model = AutoModelForTokenClassification.from_pretrained("julian-schelb/roberta-ner-multilingual-wikiann/")
113
 
114
  text = "In December 1903 in France the Royal Swedish Academy of Sciences awarded Pierre Curie, Marie Curie, and Henri Becquerel the Nobel Prize in Physics."
115
 
 
131
  predicted_tokens_classes
132
  ```
133
 
 
134
 
135
+ ## About RoBERTa
136
+
137
+ This model is a fine-tuned version of [XLM-RoBERTa](https://huggingface.co/xlm-roberta-large). The original model was pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Conneau et al. and first released in [this repository](https://github.com/pytorch/fairseq/tree/master/examples/xlmr).
138
+
139
+ RoBERTa is a transformers model pretrained on a large corpus in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.
140
+
141
+ More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
142
+
143
+ This way, the model learns an inner representation of 100 languages that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the XLM-RoBERTa model as inputs.
144
+
145
+ #### Limitations and bias
146
+
147
+ This model is limited by its training dataset of entity-annotated news articles from a specific span of time. This may not generalize well for all use cases in different domains.
148
+
149
+ ## Related Papers
150
+
151
+ * Pan, X., Zhang, B., May, J., Nothman, J., Knight, K., & Ji, H. (2017). Cross-lingual Name Tagging and Linking for 282 Languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1946–1958). Association for Computational Linguistics.
152
+ * Rahimi, A., Li, Y., & Cohn, T. (2019). Massively Multilingual Transfer for NER. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 151–164). Association for Computational Linguistics.
153
+ * Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V.. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
154
+