UNIST-Eunchan commited on
Commit
7f93a8e
·
1 Parent(s): 4490d89

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +101 -0
README.md CHANGED
@@ -258,6 +258,90 @@ widget:
258
  can benefit model development itself (section 8).
259
  Question, Answer:
260
  example_title: NLG-Eval (2202.06935)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
261
 
262
 
263
  datasets:
@@ -411,6 +495,23 @@ output= [' What was the size of each untrained model?[SEP] The size of the model
411
 
412
  ```
413
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
414
 
415
  ## Training and evaluation data
416
  - Used Dataset: [UNIST-Eunchan/NLP-Paper-to-QA-Generation](https://huggingface.co/datasets/UNIST-Eunchan/NLP-Paper-to-QA-Generation) dataset.
 
258
  can benefit model development itself (section 8).
259
  Question, Answer:
260
  example_title: NLG-Eval (2202.06935)
261
+ - text: >-
262
+ Generate Question, Answer pair correspond to the following research paper.
263
+ [Abstract] Humans have harbored a longstanding desire to acquire additional abilities through
264
+ absorption. Super Mario serves as an embodiment of this human dream, which
265
+ can collect items to gain extra skills such as throwing fireballs and being temporarily
266
+ invincible. In this paper, we uncover that Language Models (LMs), either encoderor decoder-based, can obtain new capabilities by assimilating the parameters of
267
+ homologous models without the need for retraining or GPUs. Typically, new
268
+ abilities of LMs can be imparted by Supervised Fine-Tuning (SFT), reflected in
269
+ the disparity between fine-tuned and pre-trained parameters (i.e., delta parameters).
270
+ We initially observe that by introducing a novel operation called DARE (Drop And
271
+ REscale), most of the delta parameters can be directly set to zeros without affecting
272
+ the capabilities of SFT LMs and larger models can tolerate a higher proportion
273
+ of discarded parameters. Based on this observation, we further sparsify delta
274
+ parameters of multiple SFT homologous models with DARE and subsequently
275
+ merge them into a single model by parameter averaging. We conduct experiments
276
+ on eight datasets from the GLUE benchmark with BERT and RoBERTa. We also
277
+ merge WizardLM, WizardMath, and Code Alpaca based on Llama 2. Experimental
278
+ results show that: (1) The delta parameter value ranges for SFT models are typically
279
+ small, often within 0.005, and DARE can eliminate 99% of them effortlessly.
280
+ However, once the models are continuously pre-trained, the value ranges can grow
281
+ to around 0.03, making DARE impractical. We have also tried to remove fine-tuned
282
+ instead of delta parameters and find that a 10% reduction can lead to drastically
283
+ decreased performance (even to 0.0). This highlights that SFT merely stimulates
284
+ the abilities via delta parameters rather than injecting new abilities into LMs; (2)
285
+ DARE can merge multiple task-specific LMs into one LM with diverse abilities.
286
+ For instance, the merger of WizardLM and WizardMath increases the GSM8K zeroshot accuracy of WizardLM from 2.2 to 66.3, retaining its instruction-following
287
+ ability while surpassing WizardMath’s original 64.2 performance. All resources
288
+ are available at https://github.com/yule-BUAA/MergeLM.
289
+ [Introduction] Human beings have always expressed their ambition to acquire additional abilities through various
290
+ ways such as movies and games. For example, in X-Men’s Apocalypse, the character can absorb the
291
+ powers of other mutants to strengthen himself. Likewise, the protagonist in the Super Mario games
292
+ can gain superpowers like throwing fireballs by absorbing in-game items. Large Language Models
293
+ (LLMs), such as GPT-4 [45], can reasonably be considered as early iterations of artificial general
294
+ intelligence systems, given their performance is remarkably close to human-level capabilities. In this paper, we astonishingly find that LMs, similar to Apocalypse and Super Mario, can enhance their
295
+ capabilities by absorbing other models without the need for training or GPUs.
296
+ Formally, Supervised Fine-Tuning (SFT) is the most widely adopted strategy for assigning taskspecific capabilities to LMs by optimizing their parameters [13, 67]. The effectiveness of SFT is
297
+ fully evident in the alteration of the model parameters before and after SFT, referred to as delta
298
+ parameters [12]. We initially demonstrate that SFT LM (either encoder- or decoder-based) always
299
+ tends to acquire excessively redundant delta parameters. To be specific, we present DARE, which
300
+ randomly resets some delta parameters to zeros based on a drop rate p and subsequently scales the
301
+ remaining parameters by a factor of 1/(1 − p). Despite its simplicity, with the assistance of DARE,
302
+ when the LM model parameters reach 70 billion, we can eliminate up to 99% delta parameters with
303
+ minimal impact on model performance (see Figure 1(a)). The more parameters the LM has, the
304
+ larger p it can tolerate. This discovery suggests that SFT LM indeed learns a multitude of low-rank
305
+ structures akin to LoRA [25]. Thus, even when most of these structures are removed, resulting in a
306
+ low-rank and extremely sparse delta parameter set, the LM can still retain its capabilities.
307
+ Based on this observation, we can confidently merge multiple homologous SFT LMs (pre-trained
308
+ from the same backbone) without significant concerns about the decrease in their capabilities. As
309
+ long as a small portion of the delta parameters remains unaffected in the merging process, the abilities
310
+ of LMs unlocked by SFT can still be preserved. We first employ DARE to eliminate redundant
311
+ delta parameters in each model before merging, which can potentially mitigate the interference of
312
+ parameters among multiple models [62]. Then, we apply established model merging techniques
313
+ [59, 26, 44, 27, 62] to the parameters with reduced redundancy to create a single model with diverse
314
+ capabilities. We conduct extensive experiments on encoder-based LMs on eight datasets from the
315
+ GLUE benchmark, and decoder-based Llama 2 with three distinct abilities: instruction-following,
316
+ mathematical reasoning, and code-generating. We observe that:
317
+ (1) SFT LMs exhibit a substantial number of redundant delta parameters whether they are based on
318
+ BERT, RoBERTa, or Llama 2. DARE allows the removal of approximately 90% or even 99% delta
319
+ parameters without significantly affecting the performance of downstream tasks. The rescale operation
320
+ in DARE is a crucial component to guarantee effective ablations of delta parameters. Without
321
+ rescaling, removing only 10% delta parameters would noticeably affect performance. We attribute
322
+ this phenomenon to the fact that rescaling helps preserve the connectivity of model parameters [46].
323
+ (2) DARE is able to enhance the performance of most existing model merging methods when merging
324
+ encoder-based LMs on the eight datasets from GLUE. When it comes to larger LMs based on Llama
325
+ 2, the simple parameter averaging method can already produce surprisingly good results. As shown
326
+ in Figure 1(b), we merge WizardLM and WizardMath by combining DARE and parameter averaging,
327
+ leading to a significant improvement of WizardLM’s mathematical reasoning ability from 2.2 to 64.2
328
+ accuracy on GSM8K, while also modestly enhancing its instruction-following ability with win rate
329
+ from 67.2 to 67.5 on AlpacaEval. It is worth noticing that all these benefits are achieved by solely
330
+ using CPUs without further training. Similar improvements can also be observed when merging
331
+ code-generating models. (3) DARE is applicable to SFT delta parameters whose value ranges are relatively small. Different
332
+ from the observations of delta parameters, dropping only 10% fine-tuned parameters would lead to a
333
+ catastrophic decrease in performance, even approaching zero. We also find that the delta parameters
334
+ of SFT LMs usually stay within a range of 0.005 or less, indicating minimal modifications to the
335
+ pre-trained LM. However, once we continue pre-training, the delta parameters can rapidly reach
336
+ around 0.03, making DARE infeasible. This further confirms that SFT primarily unlocks the abilities
337
+ of the pre-trained LM, rather than introducing additional abilities.
338
+ Last but not least, we have implemented an open-sourced codebase at https://github.com/
339
+ yule-BUAA/MergeLM, which integrates existing popular model merging methods and supports both
340
+ encoder- and decoder-based language models. We hope this work can advance the understanding of
341
+ how alignment works from the perspective of parameters.
342
+
343
+ Question, Answer:
344
+ example_title: LM-SuperMario (2311.03099)
345
 
346
 
347
  datasets:
 
495
 
496
  ```
497
 
498
+ ## Inference Examples
499
+ ```
500
+ If Inference API generate bad, you can use model.generate() in your code for better output!
501
+ ```
502
+
503
+ - (1) Attention is All You Need
504
+ - (https://arxiv.org/abs/1706.03762)
505
+ - (2) The Power of Scale for Parameter-Efficient Prompt Tuning
506
+ - (https://arxiv.org/abs/2104.08691)
507
+ - (3)(LK-99 Paper/ Not an NLP paper) The First Room-Temperature Ambient-Pressure Superconductor
508
+ - (https://arxiv.org/abs/2307.12008)
509
+ - (4) Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
510
+ - (https://arxiv.org/abs/2202.06935)
511
+ - (5) Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch
512
+ - (https://arxiv.org/abs/2311.03099)
513
+
514
+
515
 
516
  ## Training and evaluation data
517
  - Used Dataset: [UNIST-Eunchan/NLP-Paper-to-QA-Generation](https://huggingface.co/datasets/UNIST-Eunchan/NLP-Paper-to-QA-Generation) dataset.