tiedeman commited on
Commit
68a9048
·
1 Parent(s): a39f34c

Initial commit

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.spm filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,361 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ language:
4
+ - aai
5
+ - ace
6
+ - agn
7
+ - aia
8
+ - alj
9
+ - alp
10
+ - amk
11
+ - aoz
12
+ - apr
13
+ - atq
14
+ - aui
15
+ - ban
16
+ - bcl
17
+ - bep
18
+ - bhz
19
+ - bku
20
+ - blz
21
+ - bmk
22
+ - bnp
23
+ - bpr
24
+ - bps
25
+ - btd
26
+ - bth
27
+ - bto
28
+ - bts
29
+ - btx
30
+ - bug
31
+ - buk
32
+ - bzh
33
+ - ceb
34
+ - cgc
35
+ - ch
36
+ - dad
37
+ - dob
38
+ - dtp
39
+ - dww
40
+ - emi
41
+ - es
42
+ - far
43
+ - fil
44
+ - fj
45
+ - fr
46
+ - frd
47
+ - gfk
48
+ - gil
49
+ - gor
50
+ - haw
51
+ - hil
52
+ - hla
53
+ - hnn
54
+ - hot
55
+ - hvn
56
+ - iba
57
+ - id
58
+ - ifa
59
+ - ifb
60
+ - ifk
61
+ - ifu
62
+ - ify
63
+ - ilo
64
+ - iry
65
+ - it
66
+ - itv
67
+ - jv
68
+ - jvn
69
+ - kbm
70
+ - khz
71
+ - kje
72
+ - kne
73
+ - kpg
74
+ - kqe
75
+ - kqf
76
+ - kqw
77
+ - krj
78
+ - kud
79
+ - kwf
80
+ - kzf
81
+ - law
82
+ - lcm
83
+ - leu
84
+ - lew
85
+ - lex
86
+ - lid
87
+ - ljp
88
+ - lnd
89
+ - mad
90
+ - mak
91
+ - mbb
92
+ - mbf
93
+ - mbt
94
+ - mee
95
+ - mek
96
+ - mg
97
+ - mh
98
+ - mhy
99
+ - mi
100
+ - mmo
101
+ - mmx
102
+ - mna
103
+ - mnb
104
+ - mog
105
+ - mox
106
+ - mpx
107
+ - mqj
108
+ - mrw
109
+ - ms
110
+ - msm
111
+ - mta
112
+ - mva
113
+ - mvp
114
+ - mwc
115
+ - mwv
116
+ - myw
117
+ - mzz
118
+ - na
119
+ - nak
120
+ - nia
121
+ - nij
122
+ - npy
123
+ - nsn
124
+ - nss
125
+ - nwi
126
+ - obo
127
+ - pag
128
+ - pam
129
+ - pau
130
+ - plw
131
+ - pmf
132
+ - pne
133
+ - ppk
134
+ - prf
135
+ - pt
136
+ - ptp
137
+ - ptu
138
+ - pwg
139
+ - rai
140
+ - rej
141
+ - rro
142
+ - rug
143
+ - sas
144
+ - sbl
145
+ - sda
146
+ - sgb
147
+ - sgz
148
+ - sm
149
+ - smk
150
+ - sml
151
+ - snc
152
+ - sps
153
+ - stn
154
+ - su
155
+ - swp
156
+ - sxn
157
+ - tbc
158
+ - tbl
159
+ - tbo
160
+ - tet
161
+ - tgo
162
+ - tgp
163
+ - tl
164
+ - tlx
165
+ - to
166
+ - tpa
167
+ - tpz
168
+ - tte
169
+ - tuc
170
+ - twb
171
+ - twu
172
+ - txa
173
+ - ty
174
+ - ubr
175
+ - uvl
176
+ - viv
177
+ - war
178
+ - wed
179
+ - wuv
180
+ - xsb
181
+ - xsi
182
+ - yml
183
+
184
+ tags:
185
+ - translation
186
+ - opus-mt-tc-bible
187
+
188
+ license: apache-2.0
189
+ model-index:
190
+ - name: opus-mt-tc-bible-big-poz-fra_ita_por_spa
191
+ results:
192
+ - task:
193
+ name: Translation multi-multi
194
+ type: translation
195
+ args: multi-multi
196
+ dataset:
197
+ name: tatoeba-test-v2020-07-28-v2023-09-26
198
+ type: tatoeba_mt
199
+ args: multi-multi
200
+ metrics:
201
+ - name: BLEU
202
+ type: bleu
203
+ value: 35.8
204
+ - name: chr-F
205
+ type: chrf
206
+ value: 0.56040
207
+ ---
208
+ # opus-mt-tc-bible-big-poz-fra_ita_por_spa
209
+
210
+ ## Table of Contents
211
+ - [Model Details](#model-details)
212
+ - [Uses](#uses)
213
+ - [Risks, Limitations and Biases](#risks-limitations-and-biases)
214
+ - [How to Get Started With the Model](#how-to-get-started-with-the-model)
215
+ - [Training](#training)
216
+ - [Evaluation](#evaluation)
217
+ - [Citation Information](#citation-information)
218
+ - [Acknowledgements](#acknowledgements)
219
+
220
+ ## Model Details
221
+
222
+ Neural machine translation model for translating from Malayo-Polynesian languages (poz) to unknown (fra+ita+por+spa).
223
+
224
+ This model is part of the [OPUS-MT project](https://github.com/Helsinki-NLP/Opus-MT), an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of [Marian NMT](https://marian-nmt.github.io/), an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from [OPUS](https://opus.nlpl.eu/) and training pipelines use the procedures of [OPUS-MT-train](https://github.com/Helsinki-NLP/Opus-MT-train).
225
+ **Model Description:**
226
+ - **Developed by:** Language Technology Research Group at the University of Helsinki
227
+ - **Model Type:** Translation (transformer-big)
228
+ - **Release**: 2024-08-17
229
+ - **License:** Apache-2.0
230
+ - **Language(s):**
231
+ - Source Language(s): aai ace agn aia alj alp amk aoz apr atq aui ban bcl bep bhz bku blz bmk bnp bpr bps btd bth bto bts btx bug buk bzh ceb cgc cha dad dob dtp dww emi far fij fil frd gfk gil gor haw hil hla hnn hot hvn iba ifa ifb ifk ifu ify ilo ind iry itv jak jav jvn kbm khz kje kne kpg kqe kqf kqw krj kud kwf kzf law lcm leu lew lex lid ljp lnd mad mah mak mbb mbf mbt mee mek mhy mlg mmo mmx mna mnb mog mox mpx mqj mri mrw msa msm mta mva mvp mwc mwv myw mzz nak nau nia nij npy nsn nss nwi obo pag pam pau plt plw pmf pne ppk prf ptp ptu pwg rai rej rro rug sas sbl sda sgb sgz smk sml smo snc sps stn sun swp sxn tah tbc tbl tbo tet tgl tgo tgp tlx ton tpa tpz tte tuc twb twu txa ubr uvl viv war wed wuv xsb xsi yml zsm
232
+ - Target Language(s): fra ita por spa
233
+ - Valid Target Language Labels: >>fra<< >>ita<< >>por<< >>spa<< >>xxx<<
234
+ - **Original Model**: [opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/poz-fra+ita+por+spa/opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.zip)
235
+ - **Resources for more information:**
236
+ - [OPUS-MT dashboard](https://opus.nlpl.eu/dashboard/index.php?pkg=opusmt&test=all&scoreslang=all&chart=standard&model=Tatoeba-MT-models/poz-fra%2Bita%2Bpor%2Bspa/opusTCv20230926max50%2Bbt%2Bjhubc_transformer-big_2024-08-17)
237
+ - [OPUS-MT-train GitHub Repo](https://github.com/Helsinki-NLP/OPUS-MT-train)
238
+ - [More information about MarianNMT models in the transformers library](https://huggingface.co/docs/transformers/model_doc/marian)
239
+ - [Tatoeba Translation Challenge](https://github.com/Helsinki-NLP/Tatoeba-Challenge/)
240
+ - [HPLT bilingual data v1 (as part of the Tatoeba Translation Challenge dataset)](https://hplt-project.org/datasets/v1)
241
+ - [A massively parallel Bible corpus](https://aclanthology.org/L14-1215/)
242
+
243
+ This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of `>>id<<` (id = valid target language ID), e.g. `>>fra<<`
244
+
245
+ ## Uses
246
+
247
+ This model can be used for translation and text-to-text generation.
248
+
249
+ ## Risks, Limitations and Biases
250
+
251
+ **CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.**
252
+
253
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
254
+
255
+ ## How to Get Started With the Model
256
+
257
+ A short example code:
258
+
259
+ ```python
260
+ from transformers import MarianMTModel, MarianTokenizer
261
+
262
+ src_text = [
263
+ ">>fra<< Bag-ong iroy akong gusto.",
264
+ ">>fra<< Usá, duhá, tuló, upat, limá, unom, pitó, waló, siyam, napúlò."
265
+ ]
266
+
267
+ model_name = "pytorch-models/opus-mt-tc-bible-big-poz-fra_ita_por_spa"
268
+ tokenizer = MarianTokenizer.from_pretrained(model_name)
269
+ model = MarianMTModel.from_pretrained(model_name)
270
+ translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
271
+
272
+ for t in translated:
273
+ print( tokenizer.decode(t, skip_special_tokens=True) )
274
+
275
+ # expected output:
276
+ # Frappé comme la plante, mon cœur s'est flétri ; car j'oublie de manger mon pain.
277
+ # Lahmek, lahmeh, lahmeh, lahmeh, lahmeh, lahmeh, lahmeh.
278
+ ```
279
+
280
+ You can also use OPUS-MT models with the transformers pipelines, for example:
281
+
282
+ ```python
283
+ from transformers import pipeline
284
+ pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-poz-fra_ita_por_spa")
285
+ print(pipe(">>fra<< Bag-ong iroy akong gusto."))
286
+
287
+ # expected output: Frappé comme la plante, mon cœur s'est flétri ; car j'oublie de manger mon pain.
288
+ ```
289
+
290
+ ## Training
291
+
292
+ - **Data**: opusTCv20230926max50+bt+jhubc ([source](https://github.com/Helsinki-NLP/Tatoeba-Challenge))
293
+ - **Pre-processing**: SentencePiece (spm32k,spm32k)
294
+ - **Model Type:** transformer-big
295
+ - **Original MarianNMT Model**: [opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/poz-fra+ita+por+spa/opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.zip)
296
+ - **Training Scripts**: [GitHub Repo](https://github.com/Helsinki-NLP/OPUS-MT-train)
297
+
298
+ ## Evaluation
299
+
300
+ * [Model scores at the OPUS-MT dashboard](https://opus.nlpl.eu/dashboard/index.php?pkg=opusmt&test=all&scoreslang=all&chart=standard&model=Tatoeba-MT-models/poz-fra%2Bita%2Bpor%2Bspa/opusTCv20230926max50%2Bbt%2Bjhubc_transformer-big_2024-08-17)
301
+ * test set translations: [opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/poz-fra+ita+por+spa/opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.test.txt)
302
+ * test set scores: [opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/poz-fra+ita+por+spa/opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.eval.txt)
303
+ * benchmark results: [benchmark_results.txt](benchmark_results.txt)
304
+ * benchmark output: [benchmark_translations.zip](benchmark_translations.zip)
305
+
306
+ | langpair | testset | chr-F | BLEU | #sent | #words |
307
+ |----------|---------|-------|-------|-------|--------|
308
+ | multi-multi | tatoeba-test-v2020-07-28-v2023-09-26 | 0.56040 | 35.8 | 2097 | 15222 |
309
+
310
+ ## Citation Information
311
+
312
+ * Publications: [Democratizing neural machine translation with OPUS-MT](https://doi.org/10.1007/s10579-023-09704-w) and [OPUS-MT – Building open translation services for the World](https://aclanthology.org/2020.eamt-1.61/) and [The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt-1.139/) (Please, cite if you use this model.)
313
+
314
+ ```bibtex
315
+ @article{tiedemann2023democratizing,
316
+ title={Democratizing neural machine translation with {OPUS-MT}},
317
+ author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
318
+ journal={Language Resources and Evaluation},
319
+ number={58},
320
+ pages={713--755},
321
+ year={2023},
322
+ publisher={Springer Nature},
323
+ issn={1574-0218},
324
+ doi={10.1007/s10579-023-09704-w}
325
+ }
326
+
327
+ @inproceedings{tiedemann-thottingal-2020-opus,
328
+ title = "{OPUS}-{MT} {--} Building open translation services for the World",
329
+ author = {Tiedemann, J{\"o}rg and Thottingal, Santhosh},
330
+ booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
331
+ month = nov,
332
+ year = "2020",
333
+ address = "Lisboa, Portugal",
334
+ publisher = "European Association for Machine Translation",
335
+ url = "https://aclanthology.org/2020.eamt-1.61",
336
+ pages = "479--480",
337
+ }
338
+
339
+ @inproceedings{tiedemann-2020-tatoeba,
340
+ title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
341
+ author = {Tiedemann, J{\"o}rg},
342
+ booktitle = "Proceedings of the Fifth Conference on Machine Translation",
343
+ month = nov,
344
+ year = "2020",
345
+ address = "Online",
346
+ publisher = "Association for Computational Linguistics",
347
+ url = "https://aclanthology.org/2020.wmt-1.139",
348
+ pages = "1174--1182",
349
+ }
350
+ ```
351
+
352
+ ## Acknowledgements
353
+
354
+ The work is supported by the [HPLT project](https://hplt-project.org/), funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by [CSC -- IT Center for Science](https://www.csc.fi/), Finland, and the [EuroHPC supercomputer LUMI](https://www.lumi-supercomputer.eu/).
355
+
356
+ ## Model conversion info
357
+
358
+ * transformers version: 4.45.1
359
+ * OPUS-MT git hash: 0882077
360
+ * port time: Tue Oct 8 13:11:27 EEST 2024
361
+ * port machine: LM0-400-22516.local
benchmark_results.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ multi-multi tatoeba-test-v2020-07-28-v2023-09-26 0.56040 35.8 2097 15222
benchmark_translations.zip ADDED
File without changes
config.json ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "pytorch-models/opus-mt-tc-bible-big-poz-fra_ita_por_spa",
3
+ "activation_dropout": 0.0,
4
+ "activation_function": "relu",
5
+ "architectures": [
6
+ "MarianMTModel"
7
+ ],
8
+ "attention_dropout": 0.0,
9
+ "bos_token_id": 0,
10
+ "classifier_dropout": 0.0,
11
+ "d_model": 1024,
12
+ "decoder_attention_heads": 16,
13
+ "decoder_ffn_dim": 4096,
14
+ "decoder_layerdrop": 0.0,
15
+ "decoder_layers": 6,
16
+ "decoder_start_token_id": 60867,
17
+ "decoder_vocab_size": 60868,
18
+ "dropout": 0.1,
19
+ "encoder_attention_heads": 16,
20
+ "encoder_ffn_dim": 4096,
21
+ "encoder_layerdrop": 0.0,
22
+ "encoder_layers": 6,
23
+ "eos_token_id": 195,
24
+ "forced_eos_token_id": null,
25
+ "init_std": 0.02,
26
+ "is_encoder_decoder": true,
27
+ "max_length": null,
28
+ "max_position_embeddings": 1024,
29
+ "model_type": "marian",
30
+ "normalize_embedding": false,
31
+ "num_beams": null,
32
+ "num_hidden_layers": 6,
33
+ "pad_token_id": 60867,
34
+ "scale_embedding": true,
35
+ "share_encoder_decoder_embeddings": true,
36
+ "static_position_embeddings": true,
37
+ "torch_dtype": "float32",
38
+ "transformers_version": "4.45.1",
39
+ "use_cache": true,
40
+ "vocab_size": 60868
41
+ }
generation_config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bad_words_ids": [
4
+ [
5
+ 60867
6
+ ]
7
+ ],
8
+ "bos_token_id": 0,
9
+ "decoder_start_token_id": 60867,
10
+ "eos_token_id": 195,
11
+ "forced_eos_token_id": 195,
12
+ "max_length": 512,
13
+ "num_beams": 4,
14
+ "pad_token_id": 60867,
15
+ "transformers_version": "4.45.1"
16
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c95d9670d94a3a469cb8b2d18c8c7aecd3f5787a2ae7a51e8f5256a9dac52690
3
+ size 955017920
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0317cb0649e46a031781d9315766840b8e6d3c758235c8fd23af0c621911db36
3
+ size 955069189
source.spm ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:137c8812d64e538a1102a4c8bf8a8ec7317887c34233ed69e32697589059a732
3
+ size 772108
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>"}
target.spm ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0a7f78d035ae4999fe85b42cd4c05cf2584aff235772a58dbdb2bf0769a80d93
3
+ size 825455
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"source_lang": "poz", "target_lang": "fra+ita+por+spa", "unk_token": "<unk>", "eos_token": "</s>", "pad_token": "<pad>", "model_max_length": 512, "sp_model_kwargs": {}, "separate_vocabs": false, "special_tokens_map_file": null, "name_or_path": "marian-models/opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17/poz-fra+ita+por+spa", "tokenizer_class": "MarianTokenizer"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff