alea-institute commited on
Commit
0772425
·
verified ·
1 Parent(s): 9d81011

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: transformers
5
+ license: cc-by-4.0
6
+ tags:
7
+ - kl3m
8
+ - kl3m-002
9
+ - patent
10
+ - all the patents
11
+ - slm
12
+ date: '2024-03-12T00:00:00.000Z'
13
+ pipeline_tag: text-generation
14
+ widget:
15
+ - text: "# Title\n"
16
+ - temperature: 0.3
17
+ - do_sample: True
18
+ ---
19
+
20
+ # All the Patents 170m Model
21
+
22
+ `kl3m-002-170m-patent` is a a (very) small language model (SLM) model fine-tuned from `kl3m-002-170m` to
23
+ generate "realistic" patent text. For more information about the base model,
24
+ please see [its model page](https://huggingface.co/alea-institute/kl3m-002-170m).
25
+
26
+ # All the Patents
27
+
28
+ ## Why?
29
+
30
+ #### If a GPT2-sized model can generate a valid set of claims, should anyone be able to monopolize the invention?
31
+
32
+ At their heart, patents are a temporary, sanctioned monopoly on an invention through a license to sue. This monopoly
33
+ is justified by the public good created by encouraging innovation and the long-term impact of that innovation being
34
+ shared in the public domain.
35
+
36
+ Unfortunately, this worthy policy goal has been lost in the chaos and misuse of the patent system.
37
+
38
+ One of the most common sources of frustration is the granting of "obvious" patents. While some inventions are clearly novel
39
+ and non-obvious, many are not - but still slip through the examination process. These obvious but granted patents then
40
+ loom large over the market, creating a "thicket" that discourages use or subsequent invention in the area of the granted
41
+ patent. "Undoing" the grant of a patent is a costly and time-consuming process with possible negative consequences, and
42
+ so many of these patents simply sit as prior art on the books, even if the patentholder knows they could never enforce them.
43
+
44
+ Congress and various stakeholders have discussed and proposed changes over time, including most recently the
45
+ America Invents Act (AIA), but the problem of obvious patents persists.
46
+
47
+ But what if someone were to generate all the obvious inventions and make them public?
48
+
49
+ What if we shared the means of producing these obvious inventions so that everyone could help generate them on a normal CPU or consumer GPU?
50
+
51
+ And what if we could then make those obvious inventions easily searchable for anyone, including PTO examiners themselves, to use?
52
+
53
+ ## How it Works
54
+
55
+ We start with a small, GPT2-sized large language model - [kl3m-170](https://273ventures.com/kl3m-the-first-legal-large-language-model/) - which was trained on a clean, copyright-free dataset.
56
+ This helps us ensure that generations do not include copyrighted text, which would allow third-parties to interfere with the project
57
+ via DMCA takedown requests.
58
+
59
+ Next, we fine-tune this model on two simultaneous tasks:
60
+
61
+ 1. **Top-down drafting**: We start from the most abstract parts of the patent - the title and abstract - and then generate the detailed claims. This is a traditional next-token prediction order.
62
+
63
+ ```text
64
+ # Patent
65
+
66
+ ## Title
67
+ {title}
68
+
69
+ ## Abstract
70
+ {abstract}
71
+
72
+ ## Claims
73
+
74
+ 1. {claim 1}
75
+
76
+ 2. {claim 2}
77
+
78
+ ...
79
+ ```
80
+
81
+ 2. **Bottom-up**: We start from the most detailed part of the patent - the claims - and then generate the abstract and title. This reversed order can be thought of as similar to traditional extractive/abstractive summarization tasks.
82
+
83
+ ```text
84
+ # Patent
85
+
86
+ ## Claims
87
+
88
+ 1. {claim 1}
89
+
90
+ 2. {claim 2}
91
+
92
+ ...
93
+
94
+ ## Abstract
95
+ {abstract}
96
+
97
+ ## Title
98
+ {title}
99
+ ```
100
+
101
+ Once this fine-tuning is complete, we can then generate new patents using either technique by prompting the model as follows:
102
+
103
+ 1. **Top-down prompt**: `"# Patent\n\n## Title"`
104
+
105
+ 2. **Bottom-up prompt**: `"# Patent\n\n## Claims"`
106
+
107
+ It's critical that generation occurs with sufficient randomness and diversity to ensure that the generated patents are not
108
+ simply reproductions of the training data. This is a key area of ongoing research and development.
109
+
110
+ **Much like the real process of invention, most of the "ideas" generated by this process will be either nonsense or
111
+ unpatentable otherwise. Our goal is to estimate the "hit rate" of the model and continue to improve the efficiency and
112
+ accessibility of the generation process so that the "cost per obvious invention" is as low as possible.**
113
+
114
+ ## Current Status
115
+
116
+ This project is still in its infancy. We're doing R&D to develop prototype tools to demonstrate the possibility and
117
+ cost of generating and sharing these obvious inventions. This R&D is currently focused on data collection,
118
+ data curation, model training, and model evaluation.
119
+
120
+
121
+ ## Generation
122
+
123
+ You can generate your own examples as follows:
124
+
125
+ ```python
126
+ import json
127
+ from transformers import pipeline
128
+
129
+ # Load the model and tokenizer on CPU
130
+ p = pipeline('text-generation', 'alea-institute/kl3m-002-170m-patent', device='cpu')
131
+
132
+ # Example usage on CPU
133
+ text = "# Title\n"
134
+ print(
135
+ json.dumps(
136
+ [
137
+ r.get("generated_text")
138
+ for r in p(text, do_sample=True, temperature=0.5, num_return_sequences=3, max_new_tokens=1024)
139
+ ],
140
+ indent=2
141
+ )
142
+ )
143
+ ```
144
+
145
+ ### Related Material
146
+
147
+ * https://www.federalregister.gov/documents/2024/02/27/2024-03967/updated-guidance-for-making-a-proper-determination-of-obviousness
148
+
149
+ ## License
150
+
151
+ This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.
152
+
153
+ The model weights are released under the CC-BY 4.0 License.
154
+
155
+ ## Contact
156
+
157
+ The KL3M model family is now maintained by the [ALEA Institute](https://aleainstitute.ai). For technical support, collaboration opportunities, or general inquiries:
158
+
159
+ - GitHub: https://github.com/alea-institute/kl3m-model-research
160
+ - Email: [email protected]
161
+ - Website: https://aleainstitute.ai
162
+
163
+ ## Acknowledgments
164
+
165
+ Special thanks to 273 Ventures for developing and donating this model to the open-source community through the Alea Institute.
166
+
167
+
168
+ ## Citation
169
+
170
+ Tokenizer, dataset, and model publications are pending.
171
+
172
+ ## Contact
173
+
174
+ For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
175
+ create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research).
176
+
177
+ ![https://aleainstitute.ai](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)
config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "kl3m-002-170m-patent",
3
+ "architectures": [
4
+ "GPTNeoXForCausalLM"
5
+ ],
6
+ "attention_bias": true,
7
+ "attention_dropout": 0.0,
8
+ "bos_token_id": 0,
9
+ "classifier_dropout": 0.1,
10
+ "eos_token_id": 1,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout": 0.0,
13
+ "hidden_size": 1024,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 1024,
16
+ "layer_norm_eps": 1e-05,
17
+ "max_position_embeddings": 4096,
18
+ "model_type": "gpt_neox",
19
+ "num_attention_heads": 16,
20
+ "num_hidden_layers": 16,
21
+ "num_key_value_heads": 8,
22
+ "pad_token_id": 2,
23
+ "rms_norm_eps": 1e-06,
24
+ "rope_scaling": null,
25
+ "rope_theta": 10000,
26
+ "rotary_emb_base": 10000,
27
+ "rotary_pct": 0.25,
28
+ "tie_word_embeddings": false,
29
+ "torch_dtype": "float32",
30
+ "transformers_version": "4.38.0",
31
+ "use_cache": false,
32
+ "use_parallel_residual": true,
33
+ "vocab_size": 32768
34
+ }
generation_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "eos_token_id": 1,
5
+ "pad_token_id": 2,
6
+ "transformers_version": "4.38.0",
7
+ "use_cache": false
8
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9d4b0f72f7f1d6c9dbe753cf4b20f6f3e922771e121acef7902875071904584f
3
+ size 671774872
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|start|>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|end|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "mask_token": {
17
+ "content": "<|mask|>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "pad_token": "<|end|>",
24
+ "unk_token": {
25
+ "content": "<|unk|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff