Llama-3.2 1B 4-bit Quantized Model

Model Overview

  • Base Model: Meta-Llama/Llama-3.2-1B
  • Model Name: rautaditya/llama-3.2-1b-4bit-gptq
  • Quantization: 4-bit GPTQ (Generative Pretrained Transformer Quantization)

Model Description

This is a 4-bit quantized version of the Llama-3.2 1B model, designed to reduce model size and inference latency while maintaining reasonable performance. The quantization process allows for more efficient deployment on resource-constrained environments.

Key Features

  • Reduced model size
  • Faster inference times
  • Compatible with Hugging Face Transformers
  • GPTQ quantization for optimal compression

Quantization Details

  • Quantization Method: GPTQ (Generative Pretrained Transformer Quantization)
  • Bit Depth: 4-bit
  • Base Model: Llama-3.2 1B
  • Quantization Library: AutoGPTQ

Installation Requirements

pip install transformers accelerate auto-gptq torch

Usage

Transformers Pipeline

from transformers import AutoTokenizer, pipeline

ModelFolder = "rautaditya/llama-3.2-1b-4bit-gptq"
tokenizer = AutoTokenizer.from_pretrained(ModelFolder)
pipe = pipeline(
    "text-generation",
    model=ModelFolder,
    tokenizer=tokenizer,
    device_map="auto"
)

prompt = "What is the meaning of life?"
generated_text = pipe(prompt, max_length=100)
print(generated_text)

Direct Model Loading

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

model_name = "rautaditya/llama-3.2-1b-4bit-gptq"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoGPTQForCausalLM.from_pretrained(
    model_name, 
    device_map="auto"
)

Performance Considerations

  • Memory Efficiency: Significantly reduced memory footprint compared to full-precision model
  • Inference Speed: Faster inference due to reduced computational requirements
  • Potential Accuracy Trade-off: Minor performance degradation compared to full-precision model

Limitations

  • May show slight differences in output quality compared to the original model
  • Performance can vary based on specific use case and inference environment

Recommended Use Cases

  • Low-resource environments
  • Edge computing
  • Mobile applications
  • Embedded systems
  • Rapid prototyping

License

Please refer to the original Meta Llama 3.2 model license for usage restrictions and permissions.

Citation

If you use this model, please cite:

@misc{llama3.2_4bit_quantized,
  title={Llama-3.2 1B 4-bit Quantized Model},
  author={Raut, Aditya},
  year={2024},
  publisher={Hugging Face}
}

Contributions and Feedback

  • Open to suggestions and improvements
  • Please file issues on the GitHub repository for any bugs or performance concerns

Acknowledgments

  • Meta AI for the base Llama-3.2 model
  • Hugging Face Transformers team
  • AutoGPTQ library contributors
Downloads last month
155
Safetensors
Model size
764M params
Tensor type
F32
FP16
U8
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for rautaditya/llama-3.2-1b-4bit-gptq

Quantized
(117)
this model