QuantFactory Banner

QuantFactory/BgGPT-Gemma-2-2.6B-IT-v1.0-GGUF

This is quantized version of INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0 created using llama.cpp

Original Model Card

INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0

image/png

INSAIT introduces BgGPT-Gemma-2-2.6B-IT-v1.0, a state-of-the-art Bulgarian language model based on google/gemma-2-2b and google/gemma-2-2b-it. BgGPT-Gemma-2-2.6B-IT-v1.0 is free to use and distributed under the Gemma Terms of Use. This model was created by INSAIT, part of Sofia University St. Kliment Ohridski, in Sofia, Bulgaria.

Model description

The model was built on top of Google’s Gemma 2 2B open models. It was continuously pre-trained on around 100 billion tokens (85 billion in Bulgarian) using the Branch-and-Merge strategy INSAIT presented at EMNLP’24, allowing the model to gain outstanding Bulgarian cultural and linguistic capabilities while retaining its English performance. During the pre-training stage, we use various datasets, including Bulgarian web crawl data, freely available datasets such as Wikipedia, a range of specialized Bulgarian datasets sourced by the INSAIT Institute, and machine translations of popular English datasets. The model was then instruction-fine-tuned on a newly constructed Bulgarian instruction dataset created using real-world conversations. For more information check our blogpost.

Benchmarks and Results

image/png

image/png

We evaluate our models on a set of standard English benchmarks, a translated version of them in Bulgarian, as well as, Bulgarian specific benchmarks we collected:

  • Winogrande challenge: testing world knowledge and understanding
  • Hellaswag: testing sentence completion
  • ARC Easy/Challenge: testing logical reasoning
  • TriviaQA: testing trivia knowledge
  • GSM-8k: solving multiple-choice questions in high-school mathematics
  • Exams: solving high school problems from natural and social sciences
  • MON: contains exams across various subjects for grades 4 to 12

These benchmarks test logical reasoning, mathematics, knowledge, language understanding and other skills of the models and are provided at https://github.com/insait-institute/lm-evaluation-harness-bg. The graphs above show the performance of BgGPT 2.6B compared to other small open language models such as Microsoft's Phi 3.5 and Alibaba's Qwen 2.5 3B. The BgGPT model not only surpasses them, but also retains English performance inherited from the original Google Gemma 2 models upon which it is based.

Use in 🤗 Transformers

First install the latest version of the transformers library:

pip install -U 'transformers[torch]'

Then load the model in transformers:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0",
    torch_dtype=torch.bfloat16,
    attn_implementation="eager",
    device_map="auto",
)

Recommended Parameters

For optimal performance, we recommend the following parameters for text generation, as we have extensively tested our model with them:

from transformers import GenerationConfig

generation_params = GenerationConfig(
    max_new_tokens=2048,              # Choose maximum generation tokens
    temperature=0.1,
    top_k=25,
    top_p=1,
    repetition_penalty=1.1,
    eos_token_id=[1,107]
)

In principle, increasing temperature should work adequately as well.

Instruction format

In order to leverage instruction fine-tuning, your prompt should begin with a beginning-of-sequence token <bos> and be formatted in the Gemma 2 chat template. <bos> should only be the first token in a chat sequence.

E.g.

<bos><start_of_turn>user
Кога е основан Софийският университет?<end_of_turn>
<start_of_turn>model
 

This format is also available as a chat template via the apply_chat_template() method:

tokenizer = AutoTokenizer.from_pretrained(
    "INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0",
    use_default_system_prompt=False,
)

messages = [
    {"role": "user", "content": "Кога е основан Софийският университет?"},
]

input_ids = tokenizer.apply_chat_template(
  messages,
  return_tensors="pt",
  add_generation_prompt=True,
  return_dict=True
)

outputs = model.generate(
  **input_ids,
  generation_config=generation_params
)
print(tokenizer.decode(outputs[0]))

Important Note: Models based on Gemma 2 such as BgGPT-Gemma-2-2.6B-IT-v1.0 do not support flash attention. Using it results in degraded performance.

Use with GGML / llama.cpp

The model and instructions for usage in GGUF format are available at INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0-GGUF.

Community Feedback

We welcome feedback from the community to help improve BgGPT. If you have suggestions, encounter any issues, or have ideas for improvements, please:

  • Share your experience using the model through Hugging Face's community discussion feature or
  • Contact us at [email protected]

Your real-world usage and insights are valuable in helping us optimize the model's performance and behaviour for various use cases.

Summary

Downloads last month
157
GGUF
Model size
2.61B params
Architecture
gemma2

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for QuantFactory/BgGPT-Gemma-2-2.6B-IT-v1.0-GGUF

Base model

google/gemma-2-2b
Quantized
(44)
this model