Research with Silma Model: Challenges and Request for Community Support
Hello!
I am currently conducting a research project involving the Silma model, exploring its potential for fine-tuning on a bilingual dataset (Arabic/English). My dataset consists of 10,000 question-answer pairs, focusing on mixed content that requires context understanding in both languages. The ultimate goal is to build a robust model for answering diverse and domain-specific queries, with a specific emphasis on structured and conversational question-answering tasks.
While working on this project, I’ve faced a few challenges and would greatly appreciate insights from the community:
Challenges Faced
Fine-Tuning with QLoRA: I have fine-tuned Silma using the QLoRA method (8-bit quantization), but I noticed the model doesn’t seem to reflect the fine-tuning data effectively during inference. Despite following standard protocols (ensuring proper formatting, templates, etc.), the model’s responses remain too generic and fail to utilize the dataset’s specifics.
Merging Adapter with Base Model: I’ve attempted to merge LoRA adapters into the base model using merge_and_unload for efficient inference on my local machine (limited by a GTX 1050). However, the merged model doesn’t seem to incorporate fine-tuning knowledge. For example, questions directly related to the training data often produce irrelevant or empty responses.
Insights from a Tweet on Merging: Recently, I came across a tweet suggesting an approach for better merged models:
Quantize the base model → Merge adapters → Convert back to FP16/BF16.
However, this method is mainly designed for LLaMA-based models. I’m wondering if this could be adapted to the Silma model.
If anyone has tried something similar, I’d love to hear about your results and process
- Efficient Inference: Given the limitations of my hardware, I need to run the model locally. While I’ve quantized the base model (e.g., 4-bit), I still face challenges ensuring quality responses post-merging. I’ve tested inference pipelines, but they don't seem to work seamlessly with Silma.
Request for the Community
I would appreciate your guidance and contributions on the following:
QLoRA Fine-Tuning Snippets:
Any suggestions or code snippets for optimizing fine-tuning using QLoRA for mixed Arabic-English datasets?Efficient Merging Techniques:
Have you successfully merged adapters with the base model for Silma or similar models? How did you ensure the fine-tuned knowledge was retained?Adapting the Tweet's Method:
Do you think the method outlined in the tweet (quantization → merging → dequantization) could work with Silma? Has anyone tried this with non-LLaMA models?General Advice on Inference:
Any tips for improving inference quality, especially on low-resource hardware, without compromising the fine-tuning results?
Why This Matters
The Silma model holds great promise for bilingual and contextually rich tasks. I believe that a collaborative effort to refine fine-tuning and merging techniques can significantly enhance its usability for research and real-world applications.
Looking forward to your feedback, suggestions, and possibly sharing your experiments and insights! Let’s make Silma even better together 😊.
Hello Ahmed, it could be a fine-tuning issue ... could you please share the code you are using + the fine-tuning parameters + sample of the data?
Thank you for your response the following is fine-tuning code along with the 10k data used during training:
code: https://www.kaggle.com/code/justtestingsde/aou-arabic-english-silma-training
dataset: https://huggingface.co/datasets/astroa7m/AOU_DATA
I have looked into the code and did some changes to get the right results, please download the new notebook from the link below (will be removed later)
In general, I think the main problem was around merging the adapter. In addition, please note that SILMA is based on Gemma which does not support system messages, other than that you could do many optimizations in the parameters to get better results such as increasing batch size and gradient accumulation
One more tip: it always better to use high-level open-source libraries such as autotrain, llama factory or unsloth so you don't have to get into such problems
https://drive.google.com/file/d/1Q5XHGXrdbnDJCty25vw7_XGbLgbwE7YT/view?usp=sharing
I appreciate the time you personally spent on resolving the issue Karim, I am currently running the code from the notebook you have provided.
Sorry to bother you again, but I have couple of questions:
- The code merges the original model with the fine-tuned adapter, So can I use the model directly locally without any quantization required?
- If quantization is required wouldn't that degrades the performance significantly of the mode?
- I will keep autotrain and llama factory on mind, I just did not know I could use unsloth library with Silma
- I see the special tokens have changed after commenting
model, tokenizer = setup_chat_format(model, tokenizer)
line, is this how we use the default special tokens of the model instead the predefined ones in setup_chat_format function? - Besides the given tip, could you kindly share some resources that you find helpful for fine-tuning efficiently especially with models based on Gemma like Silma?
All of that is only a subset to get the cycle going, my organization is currently planning to purchase a workstation that goes with the recommended settings for Silma or probably for full-tuning approach.
I once again appreciate your kindness and cooperation.
1- I think you might need to quantize while loading - if you need to do so
2- Slightly ... the main issue you had was related to merging and fine-tuning params
4- The setup_chat_format is not a good fit for Gemma since it adds "<|im_start|>" instead of "" so it does not support the correct format for Gemma, here are some links
https://huggingface.co/google/gemma-2-2b-it
https://adithyask.medium.com/a-beginners-guide-to-fine-tuning-gemma-0444d46d821c
https://github.com/vwxyzjn/trl/blob/main/trl/models/utils.py#L40
5- Here you are
https://github.com/google-gemini/gemma-cookbook/tree/main/Gemma
Thank you very much for your comprehensive response and kind help.
With your assistance, I have successfully fine-tuned Silma based on the code you provided. And the model has been merged and pushed to the hub.
However, I faced two issues:
- I tried to convert the fine-tuned model to GGUF format based on this notebook https://www.kaggle.com/code/justtestingsde/silam-to-gguf which I used before for Llama3, and I encountered the following error traceback:
Traceback (most recent call last): File "/kaggle/working/llama.cpp/convert_hf_to_gguf.py", line 4579, in <module> main() File "/kaggle/working/llama.cpp/convert_hf_to_gguf.py", line 4573, in main model_instance.write() File "/kaggle/working/llama.cpp/convert_hf_to_gguf.py", line 434, in write self.prepare_tensors() File "/kaggle/working/llama.cpp/convert_hf_to_gguf.py", line 298, in prepare_tensors for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)): File "/kaggle/working/llama.cpp/convert_hf_to_gguf.py", line 2882, in modify_tensors return [(self.map_tensor_name(name), data_torch)] File "/kaggle/working/llama.cpp/convert_hf_to_gguf.py", line 214, in map_tensor_name raise ValueError(f"Can not map tensor {name!r}") ValueError: Can not map tensor 'model.layers.0.mlp.down_proj.SCB'
I can't really tell the problem here, I think maybe it is a problem with the training or merging process? I tried to convert the original Silma model using the same method and it worked, I also have seen them use it in gemma-notebook perfectly fine.
My goal is to fine-tune, quantize, convert to gguf, run it locally with ollama.
- On the inference process, once the model answers the question, and it hasn't yet reached the maximum sequence length, it keeps generating questions and answers it by itself in order to match the sequence length. How can I control such a thing? Like how can the model stop once the answer is provided? I have seen things like once it encounters a stop special token it should stop. But in my case, it doesn't, is there something that I should look up to?
I am in a state of perplexity, I have been looking into this for the past week, and I think I should maybe retrain? or remerge? I honestly don't know.
I executed the same code without any changes, could it be a problem of quantization?
I will play around locally with GGUF of Silma until I find a solution for this.
I would like to thank you once again for your invaluable guidance and unwavering support.
Thanks
My tip to you again is to use high-level training libraries instead of code so you can focus on the data and experiments instead of code mistakes and issues
I will focus on that as you said, hopefully I get the desired result eventually. Appreciate it, Karim.