jradchenko itay-levy commited on
Commit
20cdc2b
·
0 Parent(s):

Duplicate from Deci/DeciCoder-1b

Browse files

Co-authored-by: Itay Levy <[email protected]>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: text-generation
3
+ license: apache-2.0
4
+ tags:
5
+ - text generation
6
+ - Deci AI
7
+ - DeciCoder
8
+ programming_language:
9
+ - Java
10
+ - JavaScript
11
+ - Python
12
+ metrics:
13
+ - code_eval
14
+ inference: true
15
+ widget:
16
+ - text: 'def print_hello_world():'
17
+ example_title: Hello world
18
+ group: Python
19
+ model-index:
20
+ - name: DeciCoder-1b
21
+ results:
22
+ - task:
23
+ type: text-generation
24
+ dataset:
25
+ type: nuprl/MultiPL-E
26
+ name: MultiPL-HumanEval (Python)
27
+ metrics:
28
+ - name: pass@1
29
+ type: pass@1
30
+ value: 0.191
31
+ verified: false
32
+ - task:
33
+ type: text-generation
34
+ dataset:
35
+ type: nuprl/MultiPL-E
36
+ name: MultiPL-HumanEval (JavaScript)
37
+ metrics:
38
+ - name: pass@1
39
+ type: pass@1
40
+ value: 0.184
41
+ verified: false
42
+ - task:
43
+ type: text-generation
44
+ dataset:
45
+ type: nuprl/MultiPL-E
46
+ name: MultiPL-HumanEval (Java)
47
+ metrics:
48
+ - name: pass@1
49
+ type: pass@1
50
+ value: 0.166
51
+ verified: false
52
+ datasets:
53
+ - bigcode/starcoderdata
54
+ ---
55
+
56
+ # Model Card for DeciCoder 1B
57
+
58
+ DeciCoder 1B is a 1 billion parameter decoder-only code completion model
59
+ trained on the Python, Java, and Javascript subsets of [Starcoder Training Dataset](https://huggingface.co/datasets/bigcode/starcoderdata).
60
+ The model uses Grouped Query Attention and has a context window of 2048
61
+ tokens. It was trained using a Fill-in-the-Middle training objective. The model's
62
+ architecture was generated by Deci's proprietary Neural Architecture
63
+ Search-based technology, AutoNAC.
64
+
65
+ ## Model Details
66
+
67
+ - **Developed by:** Deci
68
+ - **Model type:** DeciCoder is an auto-regressive language model based on the transformer decoder architecture, using Grouped Query Attention.
69
+ - **Language(s):** Python, Java, JavaScript
70
+ - **License:** Model checkpoints are licensed under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
71
+
72
+ ## Model Architecture
73
+
74
+ | Parameters | Layers | Heads | Sequence Length | GQA num_key_value_heads | Hidden Size |
75
+ |:----------|:----------|:----------|:----------|:----------|:----------|
76
+ | 1.1B | 20 | 32 | 2048 | 4 | 2048 | |
77
+
78
+
79
+ - **Decoder layer:** Grouped Query Attention [Ainslie et al., 2023](https://arxiv.org/abs/2305.13245)
80
+ - **Position Embeddings:** Rotary Position Embeddings [Su et al., 2021](https://arxiv.org/abs/2104.09864)
81
+
82
+ ## Uses
83
+
84
+ The model is intended to do single/multiline code completion from a
85
+ context window of up to 2048k tokens. It is *not* an instruction model
86
+ and commands like \"Write a function that computes the absolute value of
87
+ an integer,\" won't yield the desired results. A more effective approach
88
+ is to frame instructions in the style of source code comments (e.g. \#
89
+ this function calculates the absolute value of an integer) or to present
90
+ a function signature and docstring, enabling the model to complete the
91
+ function's body.
92
+
93
+ ### How to Use
94
+
95
+ ```bibtex
96
+ # pip install -q transformers
97
+ import torch
98
+ from transformers import AutoModelForCausalLM, AutoTokenizer
99
+
100
+ checkpoint = "Deci/DeciCoder-1b"
101
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
102
+
103
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
104
+ model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, trust_remote_code=True).to(device)
105
+
106
+ inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
107
+ outputs = model.generate(inputs, max_new_tokens=100)
108
+ print(tokenizer.decode(outputs[0]))
109
+ ```
110
+
111
+ ### Attribution
112
+
113
+ DeciCoder was trained on StarCoder Training Dataset, filtered for
114
+ Python, Java, and Javascript code. For additional information, please
115
+ refer to [https://huggingface.co/datasets/bigcode/starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata).
116
+
117
+ ### Limitations
118
+
119
+ The model has undergone training with source code from Python, Java, and
120
+ JavaScript. While the primary language in the source is English, it does
121
+ contain other languages. Therefore, the model can produce code snippets
122
+ given some context. However, there\'s no assurance that the resulting
123
+ code will function as expected. It might be suboptimal, contain bugs, or
124
+ even exploits.
125
+
126
+ ## Training Details
127
+
128
+ ### Training Data
129
+
130
+ DeciCoder was trained on the Python, Java, and Javascript subsets of [Starcoder Training Dataset](https://huggingface.co/datasets/bigcode/starcoderdata)
131
+
132
+
133
+ ### Training Procedure
134
+
135
+ - **Warm-Up Steps**: 9000
136
+ - **Total Training Steps**: 284k
137
+ - **Total Tokens**: 446B
138
+ - **Global Batch Size**: 768
139
+ - **Optimizer**: AdamW
140
+ - **Optimizer Parameters**: beta1=0.9, beta2=0.95
141
+ - **Weight Decay**: 0.1
142
+ - **Learning Rate**: 4e-4
143
+ - **Learning Rate Schedule**: cosine
144
+
145
+ ## Evaluation
146
+
147
+ Below are DeciCoder's pass@1 on MultiPL HumanEval scores
148
+
149
+ | Python | JavaScript | Java |
150
+ |:----------|:----------|:----------|
151
+ | 19.1% | 18.4% | 16.6% |
152
+
153
+
154
+ ### Runtime Benchmarks
155
+
156
+ |Inference Tool/Hardware | A10 (tokens/sec) |A100 (tokens/sec) |
157
+ |:----------|:----------|:----------|
158
+ | PyTorch | 1,364.2 | 3,244.4 |
159
+ | Infery LLM | 3,889.3 | 11,676.8 |
160
+
161
+ - Throughput (tokens/sec) - Measured with optimal batch size per hardware - A10 on BS 128, A100 on BS 512
162
+
163
+ ## Documentation
164
+
165
+ - [Notebook](https://colab.research.google.com/drive/1JCxvBsWCZKHfIcHSMVf7GZCs3ClMQPjs)
166
+ - Blog post: [Introducing DeciCoder: The New Gold Standard in Efficient and Accurate Code Generation](https://deci.ai/blog/decicoder-efficient-and-accurate-code-generation-llm/)
167
+ - Questions:Feel free to contact us via our [Discord Community!](https://discord.com/invite/p9ecgRhDR8/)
168
+
169
+ ## How to Cite
170
+
171
+ Please cite this model using this format.
172
+
173
+ ```bibtex
174
+ @misc{DeciFoundationModels,
175
+ title = {DeciCoder},
176
+ author = {DeciAI Research Team},
177
+ year = {2023}
178
+ url={[https://huggingface.co/deci/decicoder-1b](https://huggingface.co/deci/decicoder-1b)},
179
+ }
180
+ ```
config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "DeciCoderForCausalLM"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_decicoder.DeciCoderConfig",
7
+ "AutoModelForCausalLM": "modeling_decicoder.DeciCoderForCausalLM"
8
+ },
9
+ "bos_token_id": 0,
10
+ "eos_token_id": 0,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 2048,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 5888,
15
+ "max_position_embeddings": 2048,
16
+ "num_attention_heads": 32,
17
+ "num_hidden_layers": 20,
18
+ "num_key_value_heads": 4,
19
+ "pretraining_tp": 1,
20
+ "rms_norm_eps": 1e-05,
21
+ "rope_scaling": null,
22
+ "tie_word_embeddings": false,
23
+ "torch_dtype": "bfloat16",
24
+ "use_bfloat16": true,
25
+ "transformers_version": "4.31.0.dev0",
26
+ "use_cache": true,
27
+ "vocab_size": 49152
28
+ }
configuration_decicoder.py ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from packaging import version
2
+ import transformers
3
+ if version.parse(transformers.__version__) < version.parse("4.31.0"):
4
+ raise ImportError(
5
+ f"You are using transformers=={transformers.__version__}, but transformers>=4.31.0 is required to use DeciCoder. Please upgrade transformers."
6
+ )
7
+ from transformers.models.llama.configuration_llama import LlamaConfig
8
+ from transformers.utils import logging
9
+
10
+
11
+ logger = logging.get_logger(__name__)
12
+
13
+ LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
14
+
15
+
16
+ class DeciCoderConfig(LlamaConfig):
17
+ r"""
18
+ This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA
19
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
20
+ defaults will yield a similar configuration to that of the LLaMA-7B.
21
+
22
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
23
+ documentation from [`PretrainedConfig`] for more information.
24
+
25
+
26
+ Args:
27
+ naive_attention_prefill (`bool`, *optional*, defaults to False):
28
+ Whether to use naive matmul or scaled dot product attention during prefill.
29
+ naive_attention_decode_batched (`bool`, *optional*, defaults to True):
30
+ Whether to use naive matmul or scaled dot product attention during decode for batch_size > 1.
31
+ naive_attention_decode_single (`bool`, *optional*, defaults to False):
32
+ Whether to use naive matmul or scaled dot product attention during decode for batch_size == 1.
33
+
34
+
35
+ ```"""
36
+ keys_to_ignore_at_inference = ["past_key_values"]
37
+
38
+ def __init__(
39
+ self,
40
+ naive_attention_prefill: bool = False,
41
+ naive_attention_decode_batched: bool = True,
42
+ naive_attention_decode_single: bool = False,
43
+ **kwargs,
44
+ ):
45
+ self.naive_attention_prefill = naive_attention_prefill
46
+ self.naive_attention_decode_batched = naive_attention_decode_batched
47
+ self.naive_attention_decode_single = naive_attention_decode_single
48
+
49
+ super().__init__(**kwargs,)
50
+
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:510256faa3d388cad1dcbc30c39d32f9289410a399f4c0435bec27ec135c6f0f
3
+ size 2227364400
modeling_decicoder.py ADDED
@@ -0,0 +1,253 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright and license here
3
+ """ PyTorch DeciCoder model."""
4
+ import math
5
+ from typing import Optional, Tuple
6
+
7
+ import torch
8
+ import torch.nn.functional as F
9
+ import torch.utils.checkpoint
10
+ from torch import nn
11
+ from packaging import version
12
+ import transformers
13
+ if version.parse(transformers.__version__) < version.parse("4.31.0"):
14
+ raise ImportError(
15
+ f"You are using transformers=={transformers.__version__}, but transformers>=4.31.0 is required to use DeciCoder. Please upgrade transformers."
16
+ )
17
+ from transformers.models.llama.modeling_llama import LlamaMLP, LlamaRMSNorm, LlamaAttention, apply_rotary_pos_emb, \
18
+ repeat_kv, LlamaPreTrainedModel, LLAMA_START_DOCSTRING, LlamaDecoderLayer, LlamaForCausalLM, LlamaModel
19
+ from transformers.utils import add_start_docstrings
20
+
21
+ from .configuration_decicoder import DeciCoderConfig
22
+
23
+ _CONFIG_FOR_DOC = "DeciCoderConfig"
24
+
25
+
26
+ class DeciCoderAttention(LlamaAttention):
27
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
28
+
29
+ def __init__(self, config: DeciCoderConfig):
30
+ nn.Module.__init__(self)
31
+ self.config = config
32
+ self.hidden_size = config.hidden_size
33
+ self.num_heads = config.num_attention_heads
34
+ self.head_dim = self.hidden_size // self.num_heads
35
+ self.num_key_value_heads = config.num_key_value_heads
36
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
37
+ self.pretraining_tp = config.pretraining_tp
38
+ self.max_position_embeddings = config.max_position_embeddings
39
+ self.rope_theta = getattr(config, 'rope_theta', None)
40
+
41
+ if (self.head_dim * self.num_heads) != self.hidden_size:
42
+ raise ValueError(
43
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
44
+ f" and `num_heads`: {self.num_heads})."
45
+ )
46
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
47
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
48
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
49
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
50
+
51
+ self.naive_attention_prefill = config.naive_attention_prefill
52
+ self.naive_attention_decode_batched = config.naive_attention_decode_batched
53
+ self.naive_attention_decode_single = config.naive_attention_decode_single
54
+ self._init_rope()
55
+
56
+ def forward(
57
+ self,
58
+ hidden_states: torch.Tensor,
59
+ attention_mask: Optional[torch.Tensor] = None,
60
+ position_ids: Optional[torch.LongTensor] = None,
61
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
62
+ output_attentions: bool = False,
63
+ use_cache: bool = False,
64
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
65
+ bsz, q_len, _ = hidden_states.size()
66
+ if past_key_value is None:
67
+ is_decode = False
68
+ else:
69
+ is_decode = True
70
+ if self.pretraining_tp > 1:
71
+ key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.pretraining_tp
72
+ query_slices = self.q_proj.weight.split((self.num_heads * self.head_dim) // self.pretraining_tp, dim=0)
73
+ key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
74
+ value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
75
+
76
+ query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.pretraining_tp)]
77
+ query_states = torch.cat(query_states, dim=-1)
78
+
79
+ key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.pretraining_tp)]
80
+ key_states = torch.cat(key_states, dim=-1)
81
+
82
+ value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.pretraining_tp)]
83
+ value_states = torch.cat(value_states, dim=-1)
84
+
85
+ else:
86
+ query_states = self.q_proj(hidden_states)
87
+ key_states = self.k_proj(hidden_states)
88
+ value_states = self.v_proj(hidden_states)
89
+
90
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
91
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
92
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
93
+
94
+ kv_seq_len = key_states.shape[-2]
95
+ if past_key_value is not None:
96
+ kv_seq_len += past_key_value[0].shape[-2]
97
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
98
+
99
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
100
+
101
+ if past_key_value is not None:
102
+ # reuse k, v, self_attention
103
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
104
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
105
+
106
+ past_key_value = (key_states, value_states) if use_cache else None
107
+
108
+ # repeat k/v heads if n_kv_heads < n_heads
109
+ if is_decode:
110
+ query_states = query_states.view(bsz, self.num_key_value_heads, self.num_key_value_groups, self.head_dim)
111
+ if self.naive_attention_decode_batched and bsz > 1 or self.naive_attention_decode_single and bsz == 1:
112
+ attn_weights = (query_states @ key_states.transpose(-2, -1)) / math.sqrt(key_states.size(-1))
113
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
114
+ if attention_mask is not None:
115
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
116
+ raise ValueError(
117
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
118
+ )
119
+ attn_weights = attn_weights + attention_mask
120
+
121
+ attn_output = torch.matmul(attn_weights, value_states)
122
+ else:
123
+ attn_output = F.scaled_dot_product_attention(query_states, key_states, value_states, is_causal=False,
124
+ dropout_p=0.0)
125
+ attn_output = attn_output.contiguous().view(bsz, q_len, self.hidden_size)
126
+
127
+ else:
128
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
129
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
130
+
131
+ if not self.naive_attention_prefill:
132
+ attn_output = F.scaled_dot_product_attention(query_states, key_states, value_states, is_causal=True,
133
+ dropout_p=0.0)
134
+ else:
135
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
136
+ # attn_weights = (query_states @ key_states.transpose(-2, -1)) / math.sqrt(key_states.size(-1))
137
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
138
+ raise ValueError(
139
+ f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
140
+ f" {attn_weights.size()}"
141
+ )
142
+
143
+ if attention_mask is not None:
144
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
145
+ raise ValueError(
146
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
147
+ )
148
+ attn_weights = attn_weights + attention_mask
149
+
150
+ # upcast attention to fp32
151
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
152
+ attn_output = torch.matmul(attn_weights, value_states)
153
+
154
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
155
+ raise ValueError(
156
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
157
+ f" {attn_output.size()}"
158
+ )
159
+
160
+ attn_output = attn_output.transpose(1, 2).contiguous().view(bsz, q_len, self.hidden_size)
161
+ # attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
162
+
163
+ if self.pretraining_tp > 1:
164
+ attn_output = attn_output.split(self.hidden_size // self.pretraining_tp, dim=2)
165
+ o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.pretraining_tp, dim=1)
166
+ attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.pretraining_tp)])
167
+ else:
168
+ attn_output = self.o_proj(attn_output)
169
+
170
+ if not output_attentions:
171
+ attn_weights = None
172
+
173
+ return attn_output, attn_weights, past_key_value
174
+
175
+
176
+ class DeciCoderDecoderLayer(LlamaDecoderLayer):
177
+ def __init__(self, config: DeciCoderConfig):
178
+ nn.Module.__init__(self)
179
+ self.hidden_size = config.hidden_size
180
+ self.self_attn = DeciCoderAttention(config=config)
181
+ self.mlp = LlamaMLP(config)
182
+ self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
183
+ self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
184
+
185
+
186
+ @add_start_docstrings(
187
+ "The bare DeciCoder Model outputting raw hidden-states without any specific head on top.",
188
+ LLAMA_START_DOCSTRING,
189
+ )
190
+ class DeciCoderPreTrainedModel(LlamaPreTrainedModel):
191
+ config_class = DeciCoderConfig
192
+ _no_split_modules = ["DeciCoderDecoderLayer"]
193
+ _keys_to_ignore_on_load_missing = ["self_attn.rotary_emb.inv_freq"]
194
+
195
+
196
+ @add_start_docstrings(
197
+ "The bare DeciCoder Model outputting raw hidden-states without any specific head on top.",
198
+ LLAMA_START_DOCSTRING,
199
+ )
200
+ class DeciCoderModel(LlamaModel, DeciCoderPreTrainedModel):
201
+ """
202
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`DeciCoderDecoderLayer`]
203
+
204
+ Args:
205
+ config: DeciCoderConfig
206
+ """
207
+
208
+ def __init__(self, config: DeciCoderConfig):
209
+ DeciCoderPreTrainedModel.__init__(self, config)
210
+ self.padding_idx = config.pad_token_id
211
+ self.vocab_size = config.vocab_size
212
+
213
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
214
+ self.layers = nn.ModuleList([DeciCoderDecoderLayer(config) for _ in range(config.num_hidden_layers)])
215
+ self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
216
+
217
+ self.gradient_checkpointing = False
218
+ # Initialize weights and apply final processing
219
+ self.post_init()
220
+
221
+ def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
222
+ self._validate_config_supports_attention_mask(attention_mask, input_shape, past_key_values_length)
223
+ return LlamaModel._prepare_decoder_attention_mask(
224
+ self, attention_mask, input_shape, inputs_embeds, past_key_values_length)
225
+
226
+ def _validate_config_supports_attention_mask(self, attention_mask, input_shape, past_key_values_length):
227
+ is_decode = past_key_values_length > 0
228
+ if not torch.all(torch.eq(attention_mask, 1)).item():
229
+ if is_decode:
230
+ if input_shape[0] == 1 and not self.config.naive_attention_decode_single:
231
+ raise ValueError(
232
+ "For support of custom attention masks please set naive_attention_decode_single to True in the "
233
+ "config")
234
+ elif input_shape[0] > 1 and not self.config.naive_attention_decode_batched:
235
+ raise ValueError(
236
+ "For support of custom attention masks please set naive_attention_decode_batched to True in the"
237
+ "config")
238
+ else:
239
+ if not self.config.naive_attention_prefill:
240
+ raise ValueError("For support of custom attention masks please set naive_attention_prefill to "
241
+ "True in the config")
242
+
243
+
244
+ class DeciCoderForCausalLM(LlamaForCausalLM, DeciCoderPreTrainedModel):
245
+ def __init__(self, config):
246
+ DeciCoderPreTrainedModel.__init__(self, config)
247
+ self.model = DeciCoderModel(config)
248
+ self.pretraining_tp = config.pretraining_tp
249
+ self.vocab_size = config.vocab_size
250
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
251
+
252
+ # Initialize weights and apply final processing
253
+ self.post_init()
special_tokens_map.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|endoftext|>",
4
+ "<fim_prefix>",
5
+ "<fim_middle>",
6
+ "<fim_suffix>",
7
+ "<fim_pad>",
8
+ "<filename>",
9
+ "<gh_stars>",
10
+ "<issue_start>",
11
+ "<issue_comment>",
12
+ "<issue_closed>",
13
+ "<jupyter_start>",
14
+ "<jupyter_text>",
15
+ "<jupyter_code>",
16
+ "<jupyter_output>",
17
+ "<empty_output>",
18
+ "<commit_before>",
19
+ "<commit_msg>",
20
+ "<commit_after>",
21
+ "<reponame>"
22
+ ],
23
+ "bos_token": "<|endoftext|>",
24
+ "eos_token": "<|endoftext|>",
25
+ "unk_token": "<|endoftext|>"
26
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "additional_special_tokens": [
4
+ "<|endoftext|>",
5
+ "<fim_prefix>",
6
+ "<fim_middle>",
7
+ "<fim_suffix>",
8
+ "<fim_pad>",
9
+ "<filename>",
10
+ "<gh_stars>",
11
+ "<issue_start>",
12
+ "<issue_comment>",
13
+ "<issue_closed>",
14
+ "<jupyter_start>",
15
+ "<jupyter_text>",
16
+ "<jupyter_code>",
17
+ "<jupyter_output>",
18
+ "<empty_output>",
19
+ "<commit_before>",
20
+ "<commit_msg>",
21
+ "<commit_after>",
22
+ "<reponame>"
23
+ ],
24
+ "bos_token": "<|endoftext|>",
25
+ "eos_token": "<|endoftext|>",
26
+ "model_max_length": 1000000000000000019884624838656,
27
+ "tokenizer_class": "GPT2Tokenizer",
28
+ "unk_token": "<|endoftext|>",
29
+ "vocab_size": 49152
30
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff