ayethuzar commited on
Commit
936b764
·
unverified ·
1 Parent(s): 0f6d279

upate milestone4 doc

Browse files
Files changed (1) hide show
  1. milestone4Documentation.md +74 -138
milestone4Documentation.md CHANGED
@@ -1,138 +1,74 @@
1
- {
2
- "nbformat": 4,
3
- "nbformat_minor": 0,
4
- "metadata": {
5
- "colab": {
6
- "provenance": []
7
- },
8
- "kernelspec": {
9
- "name": "python3",
10
- "display_name": "Python 3"
11
- },
12
- "language_info": {
13
- "name": "python"
14
- }
15
- },
16
- "cells": [
17
- {
18
- "cell_type": "markdown",
19
- "source": [
20
- "# CS 670 Project - Finetuning Language Models"
21
- ],
22
- "metadata": {
23
- "id": "plgYaqGbr0LM"
24
- }
25
- },
26
- {
27
- "cell_type": "markdown",
28
- "source": [
29
- "************************\n",
30
- "\n",
31
- "Deliverables\n",
32
- "\n",
33
- "************************\n",
34
- "\n",
35
- "Milestone-3 notebook: https://github.com/aye-thuzar/CS670Project/blob/main/CS670_milestone_3_AyeThuzar.ipynb\n",
36
- "\n",
37
- "Hugging Face App: https://huggingface.co/spaces/ayethuzar/can-i-patent-this\n",
38
- "\n",
39
- "Landing Page for the App: https://sites.google.com/view/cs670-finetuning-language-mode/home\n",
40
- "\n",
41
- "App Demonstration Video: [https://youtu.be/UEWUe-8fDOw](https://youtu.be/IXMJDoUqXK4)\n",
42
- "\n",
43
- "The tuned model shared to the Hugging Face Hub: https://huggingface.co/ayethuzar/tuned-for-patentability/tree/main\n",
44
- "\n",
45
- "************************\n"
46
- ],
47
- "metadata": {
48
- "id": "GIL5rFb4r5dc"
49
- }
50
- },
51
- {
52
- "cell_type": "markdown",
53
- "source": [
54
- "Dataset: https://github.com/suzgunmirac/hupd"
55
- ],
56
- "metadata": {
57
- "id": "oAdWeGdcr8_T"
58
- }
59
- },
60
- {
61
- "cell_type": "markdown",
62
- "source": [
63
- "**Data Preprocessing**\n",
64
- "\n",
65
- " I used the load_dataset function to load all the patent applications that were filed to the USPTO in January 2016. We specify the date ranges of the training and validation sets as January 1-21, 2016 and January 22-31, 2016, respectively. This is a smaller dataset.\n",
66
- "\n",
67
- " There are two datasets: train and validation. Here are the steps I did:\n",
68
- "\n",
69
- " - Label-to-index mapping for the decision status field\n",
70
- " - map the 'abstract' and 'claims' sections and tokenize them using pretrained('distilbert-base-uncased') tokenizer\n",
71
- " - format them\n",
72
- " - use DataLoader with batch_size = 16"
73
- ],
74
- "metadata": {
75
- "id": "DwKVDJSWr_Tc"
76
- }
77
- },
78
- {
79
- "cell_type": "markdown",
80
- "source": [
81
- "**milestone 3:**\n",
82
- "\n",
83
- "The following notebook has the tuned model. There are 6 classes in the Harvard USPTO patent dataset and I decided to encode them as follow:\n",
84
- "\n",
85
- "decision_to_str = {'REJECTED': 0, 'ACCEPTED': 1, 'PENDING': 1, 'CONT-REJECTED': 0, 'CONT-ACCEPTED': 1, 'CONT-PENDING': 1}\n",
86
- "\n",
87
- "so that I can get a patentability score between 0 and 1.\n",
88
- "\n",
89
- "I use the pertained-model 'distilbert-base-uncased' from the Hugging face hub and tune it with the smaller dataset.\n",
90
- "\n",
91
- "My tuned model's performance is not good but I ran out of time. =(\n",
92
- "\n",
93
- "milestone3 notebook: https://github.com/aye-thuzar/CS670Project/blob/main/CS670_milestone_3_AyeThuzar.ipynb\n",
94
- "\n",
95
- "The tuned model shared to the Hugging Face Hub: https://huggingface.co/ayethuzar/tuned-for-patentability/tree/main\n",
96
- "\n",
97
- "I tested my shared model here: https://github.com/aye-thuzar/CS670Project/blob/main/CS670_Examples.ipynb"
98
- ],
99
- "metadata": {
100
- "id": "TCLsgp79sBnG"
101
- }
102
- },
103
- {
104
- "cell_type": "markdown",
105
- "source": [
106
- "**milestone 4**\n",
107
- "\n",
108
- "This is the landing page for milestone 4 : https://sites.google.com/view/cs670-finetuning-language-mode/home\n",
109
- "\n",
110
- "The documentation for milestone 4: https://github.com/aye-thuzar/CS670Project/blob/main/milestone4Documentation.md\n",
111
- "\n",
112
- "I did not get a chance to fix my video, so it only has the model before I tuned it. After my tuned it, my model is only showing a patentabiilty score no matter which texts, I put for abstract and claims. =("
113
- ],
114
- "metadata": {
115
- "id": "O9Y9HKhZ5-09"
116
- }
117
- },
118
- {
119
- "cell_type": "markdown",
120
- "source": [
121
- "**************\n",
122
- "\n",
123
- "References:\n",
124
- "\n",
125
- "1. https://colab.research.google.com/drive/1_ZsI7WFTsEO0iu_0g3BLTkIkOUqPzCET?usp=sharing#scrollTo=B5wxZNhXdUK6\n",
126
- "2. https://huggingface.co/AI-Growth-Lab/PatentSBERTa\n",
127
- "3. https://huggingface.co/anferico/bert-for-patents\n",
128
- "4. https://huggingface.co/transformers/v3.2.0/custom_datasets.html\n",
129
- "5. https://colab.research.google.com/drive/1TzDDCDt368cUErH86Zc_P2aw9bXaaZy1?usp=sharing\n",
130
- "6. https://huggingface.co/docs/transformers/model_sharing\n",
131
- "7. https://docs.streamlit.io/library/api-reference/widgets/st.file_uploader"
132
- ],
133
- "metadata": {
134
- "id": "VXhpu-LosEKk"
135
- }
136
- }
137
- ]
138
- }
 
1
+ # CS 670 Project - Finetuning Language Models
2
+
3
+ ************************
4
+
5
+ Deliverables
6
+
7
+ ************************
8
+
9
+ Milestone-3 notebook: https://github.com/aye-thuzar/CS670Project/blob/main/CS670_milestone_3_AyeThuzar.ipynb
10
+
11
+ Hugging Face App: https://huggingface.co/spaces/ayethuzar/can-i-patent-this
12
+
13
+ Landing Page for the App: https://sites.google.com/view/cs670-finetuning-language-mode/home
14
+
15
+ App Demonstration Video: [https://youtu.be/UEWUe-8fDOw](https://youtu.be/IXMJDoUqXK4)
16
+
17
+ The tuned model shared to the Hugging Face Hub: https://huggingface.co/ayethuzar/tuned-for-patentability/tree/main
18
+
19
+ ************************
20
+
21
+ Dataset: https://github.com/suzgunmirac/hupd
22
+
23
+
24
+
25
+
26
+ **Data Preprocessing**
27
+
28
+ I used the load_dataset function to load all the patent applications that were filed to the USPTO in January 2016. We specify the date ranges of the training and validation sets as January 1-21, 2016 and January 22-31, 2016, respectively. This is a smaller dataset.
29
+
30
+ There are two datasets: train and validation. Here are the steps I did:
31
+
32
+ - Label-to-index mapping for the decision status field
33
+ - map the 'abstract' and 'claims' sections and tokenize them using pretrained('distilbert-base-uncased') tokenizer
34
+ - format them
35
+ - use DataLoader with batch_size = 16
36
+
37
+ **milestone3:**
38
+
39
+ The following notebook has the tuned model. There are 6 classes in the Harvard USPTO patent dataset and I decided to encode them as follow:
40
+
41
+ decision_to_str = {'REJECTED': 0, 'ACCEPTED': 1, 'PENDING': 1, 'CONT-REJECTED': 0, 'CONT-ACCEPTED': 1, 'CONT-PENDING': 1}
42
+
43
+ so that I can get a patentability score between 0 and 1.
44
+
45
+ I use the pertained-model 'distilbert-base-uncased' from the Hugging face hub and tune it with the smaller dataset.
46
+
47
+ My tuned model's performance is not good but I ran out of time. =(
48
+
49
+ milestone3 notebook: https://github.com/aye-thuzar/CS670Project/blob/main/CS670_milestone_3_AyeThuzar.ipynb
50
+
51
+ The tuned model shared to the Hugging Face Hub: https://huggingface.co/ayethuzar/tuned-for-patentability/tree/main
52
+
53
+ I tested my shared model here: https://github.com/aye-thuzar/CS670Project/blob/main/CS670_Examples.ipynb
54
+
55
+
56
+ **milestone 4**
57
+
58
+ This is the landing page for milestone 4 : https://sites.google.com/view/cs670-finetuning-language-mode/home
59
+
60
+ The documentation for milestone 4: https://github.com/aye-thuzar/CS670Project/blob/main/milestone4Documentation.md
61
+
62
+ I did not get a chance to fix my video, so it only has the model before I tuned it. After my tuned it, my model is only showing a patentabiilty score no matter which texts, I put for abstract and claims. =(
63
+
64
+ **************
65
+
66
+ References:
67
+
68
+ 1. https://colab.research.google.com/drive/1_ZsI7WFTsEO0iu_0g3BLTkIkOUqPzCET?usp=sharing#scrollTo=B5wxZNhXdUK6
69
+ 2. https://huggingface.co/AI-Growth-Lab/PatentSBERTa
70
+ 3. https://huggingface.co/anferico/bert-for-patents
71
+ 4. https://huggingface.co/transformers/v3.2.0/custom_datasets.html
72
+ 5. https://colab.research.google.com/drive/1TzDDCDt368cUErH86Zc_P2aw9bXaaZy1?usp=sharing
73
+ 6. https://huggingface.co/docs/transformers/model_sharing
74
+ 7. https://docs.streamlit.io/library/api-reference/widgets/st.file_uploader