ZhiyuanChen
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -10,6 +10,19 @@ library_name: multimolecule
|
|
10 |
pipeline_tag: fill-mask
|
11 |
mask_token: "<mask>"
|
12 |
widget:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
- example_title: "microRNA-21"
|
14 |
text: "UAGC<mask><mask><mask>UCAGACUGAUGUUGA"
|
15 |
output:
|
@@ -101,7 +114,7 @@ The OFFICIAL repository of 3UTRBERT is at [yangyn533/3UTRBERT](https://github.co
|
|
101 |
- **Paper**: [Deciphering 3’ UTR mediated gene regulation using interpretable deep representation learning](https://doi.org/10.1101/2023.09.08.556883)
|
102 |
- **Developed by**: Yuning Yang, Gen Li, Kuan Pang, Wuxinhao Cao, Xiangtao Li, Zhaolei Zhang
|
103 |
- **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [FlashAttention](https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention)
|
104 |
-
- **Original Repository**: [
|
105 |
|
106 |
## Usage
|
107 |
|
@@ -120,29 +133,29 @@ You can use this model directly with a pipeline for masked language modeling:
|
|
120 |
```python
|
121 |
>>> import multimolecule # you must import multimolecule to register models
|
122 |
>>> from transformers import pipeline
|
123 |
-
>>> unmasker = pipeline(
|
124 |
-
>>> unmasker("
|
125 |
-
|
126 |
-
[{'score': 0.
|
127 |
-
'token':
|
128 |
-
'token_str': '
|
129 |
-
'sequence': '<cls>
|
130 |
-
{'score': 0.
|
131 |
-
'token': 72,
|
132 |
-
'token_str': 'GUC',
|
133 |
-
'sequence': '<cls> UAG <mask> GUC <mask> CAG AGA GAC ACU CUG UGA GAU AUG UGU GUU UUG UGA <eos>'},
|
134 |
-
{'score': 0.06567499041557312,
|
135 |
'token': 32,
|
136 |
'token_str': 'CAC',
|
137 |
-
'sequence': '<cls>
|
138 |
-
{'score': 0.
|
139 |
-
'token':
|
140 |
-
'token_str': '
|
141 |
-
'sequence': '<cls>
|
142 |
-
{'score': 0.
|
143 |
-
'token':
|
144 |
-
'token_str': '
|
145 |
-
'sequence': '<cls>
|
|
|
|
|
|
|
|
|
146 |
```
|
147 |
|
148 |
### Downstream Use
|
@@ -155,11 +168,11 @@ Here is how to use this model to get the features of a given sequence in PyTorch
|
|
155 |
from multimolecule import RnaTokenizer, UtrBertModel
|
156 |
|
157 |
|
158 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
159 |
-
model = UtrBertModel.from_pretrained(
|
160 |
|
161 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
162 |
-
input = tokenizer(text, return_tensors=
|
163 |
|
164 |
output = model(**input)
|
165 |
```
|
@@ -175,17 +188,17 @@ import torch
|
|
175 |
from multimolecule import RnaTokenizer, UtrBertForSequencePrediction
|
176 |
|
177 |
|
178 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
179 |
-
model = UtrBertForSequencePrediction.from_pretrained(
|
180 |
|
181 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
182 |
-
input = tokenizer(text, return_tensors=
|
183 |
label = torch.tensor([1])
|
184 |
|
185 |
output = model(**input, labels=label)
|
186 |
```
|
187 |
|
188 |
-
####
|
189 |
|
190 |
**Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
|
191 |
|
@@ -193,14 +206,14 @@ Here is how to use this model as backbone to fine-tune for a nucleotide-level ta
|
|
193 |
|
194 |
```python
|
195 |
import torch
|
196 |
-
from multimolecule import RnaTokenizer,
|
197 |
|
198 |
|
199 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
200 |
-
model =
|
201 |
|
202 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
203 |
-
input = tokenizer(text, return_tensors=
|
204 |
label = torch.randint(2, (len(text), ))
|
205 |
|
206 |
output = model(**input, labels=label)
|
@@ -217,11 +230,11 @@ import torch
|
|
217 |
from multimolecule import RnaTokenizer, UtrBertForContactPrediction
|
218 |
|
219 |
|
220 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
221 |
-
model = UtrBertForContactPrediction.from_pretrained(
|
222 |
|
223 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
224 |
-
input = tokenizer(text, return_tensors=
|
225 |
label = torch.randint(2, (len(text), len(text)))
|
226 |
|
227 |
output = model(**input, labels=label)
|
|
|
10 |
pipeline_tag: fill-mask
|
11 |
mask_token: "<mask>"
|
12 |
widget:
|
13 |
+
- example_title: "HIV-1"
|
14 |
+
text: "GGUC<mask>CUCUGGUUAGACCAGAUCUGAGCCU"
|
15 |
+
output:
|
16 |
+
- label: "CUC"
|
17 |
+
score: 0.40745577216148376
|
18 |
+
- label: "CAC"
|
19 |
+
score: 0.40001827478408813
|
20 |
+
- label: "CCC"
|
21 |
+
score: 0.14566268026828766
|
22 |
+
- label: "CGC"
|
23 |
+
score: 0.04422207176685333
|
24 |
+
- label: "CAU"
|
25 |
+
score: 0.0008025980787351727
|
26 |
- example_title: "microRNA-21"
|
27 |
text: "UAGC<mask><mask><mask>UCAGACUGAUGUUGA"
|
28 |
output:
|
|
|
114 |
- **Paper**: [Deciphering 3’ UTR mediated gene regulation using interpretable deep representation learning](https://doi.org/10.1101/2023.09.08.556883)
|
115 |
- **Developed by**: Yuning Yang, Gen Li, Kuan Pang, Wuxinhao Cao, Xiangtao Li, Zhaolei Zhang
|
116 |
- **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [FlashAttention](https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention)
|
117 |
+
- **Original Repository**: [yangyn533/3UTRBERT](https://github.com/yangyn533/3UTRBERT)
|
118 |
|
119 |
## Usage
|
120 |
|
|
|
133 |
```python
|
134 |
>>> import multimolecule # you must import multimolecule to register models
|
135 |
>>> from transformers import pipeline
|
136 |
+
>>> unmasker = pipeline("fill-mask", model="multimolecule/utrbert-3mer")
|
137 |
+
>>> unmasker("gguc<mask><mask><mask>cugguuagaccagaucugagccu")[1]
|
138 |
+
|
139 |
+
[{'score': 0.40745577216148376,
|
140 |
+
'token': 47,
|
141 |
+
'token_str': 'CUC',
|
142 |
+
'sequence': '<cls> GGU GUC <mask> CUC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},
|
143 |
+
{'score': 0.40001827478408813,
|
|
|
|
|
|
|
|
|
144 |
'token': 32,
|
145 |
'token_str': 'CAC',
|
146 |
+
'sequence': '<cls> GGU GUC <mask> CAC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},
|
147 |
+
{'score': 0.14566268026828766,
|
148 |
+
'token': 37,
|
149 |
+
'token_str': 'CCC',
|
150 |
+
'sequence': '<cls> GGU GUC <mask> CCC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},
|
151 |
+
{'score': 0.04422207176685333,
|
152 |
+
'token': 42,
|
153 |
+
'token_str': 'CGC',
|
154 |
+
'sequence': '<cls> GGU GUC <mask> CGC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},
|
155 |
+
{'score': 0.0008025980787351727,
|
156 |
+
'token': 34,
|
157 |
+
'token_str': 'CAU',
|
158 |
+
'sequence': '<cls> GGU GUC <mask> CAU <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'}]
|
159 |
```
|
160 |
|
161 |
### Downstream Use
|
|
|
168 |
from multimolecule import RnaTokenizer, UtrBertModel
|
169 |
|
170 |
|
171 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/utrbert-3mer")
|
172 |
+
model = UtrBertModel.from_pretrained("multimolecule/utrbert-3mer")
|
173 |
|
174 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
175 |
+
input = tokenizer(text, return_tensors="pt")
|
176 |
|
177 |
output = model(**input)
|
178 |
```
|
|
|
188 |
from multimolecule import RnaTokenizer, UtrBertForSequencePrediction
|
189 |
|
190 |
|
191 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/utrbert-3mer")
|
192 |
+
model = UtrBertForSequencePrediction.from_pretrained("multimolecule/utrbert-3mer")
|
193 |
|
194 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
195 |
+
input = tokenizer(text, return_tensors="pt")
|
196 |
label = torch.tensor([1])
|
197 |
|
198 |
output = model(**input, labels=label)
|
199 |
```
|
200 |
|
201 |
+
#### Token Classification / Regression
|
202 |
|
203 |
**Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
|
204 |
|
|
|
206 |
|
207 |
```python
|
208 |
import torch
|
209 |
+
from multimolecule import RnaTokenizer, UtrBertForTokenPrediction
|
210 |
|
211 |
|
212 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/utrbert-3mer")
|
213 |
+
model = UtrBertForTokenPrediction.from_pretrained("multimolecule/utrbert-3mer")
|
214 |
|
215 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
216 |
+
input = tokenizer(text, return_tensors="pt")
|
217 |
label = torch.randint(2, (len(text), ))
|
218 |
|
219 |
output = model(**input, labels=label)
|
|
|
230 |
from multimolecule import RnaTokenizer, UtrBertForContactPrediction
|
231 |
|
232 |
|
233 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/utrbert-3mer")
|
234 |
+
model = UtrBertForContactPrediction.from_pretrained("multimolecule/utrbert-3mer")
|
235 |
|
236 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
237 |
+
input = tokenizer(text, return_tensors="pt")
|
238 |
label = torch.randint(2, (len(text), len(text)))
|
239 |
|
240 |
output = model(**input, labels=label)
|