File size: 11,216 Bytes
752d436 bece9cd 399c9b8 752d436 16184aa 752d436 16184aa 752d436 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
---
language:
- en
license_name: intel-research-use-license
license_link: LICENSE.md
base_model: google/gemma-7b-it
tags:
- LLM
- Intel
model-index:
- name: llava-gemma-7b
results:
- task:
type: Large Language Model
name: Large Language Model
metrics:
- type: GQA
name: GQA
value: 0.472
- type: MME Cog.
name: MME Cog.
value: 254
- type: MME Per.
name: MME Per.
value: 895
- type: MM-Vet
name: MM-Vet
value: 18.2
- type: POPE Acc.
name: POPE Acc.
value: 0.848
- type: POPE F1
name: POPE F1
value: 0.829
- type: VQAv2
name: VQAv2
value: 68.7
- type: MMVP
name: MMVP
value: 0.327
- type: ScienceQA Image
name: ScienceQA Image
value: 0.625
library_name: transformers
pipeline_tag: image-text-to-text
---
## Model Details: LLaVA-Gemma-7b
`llava-gemma-7b` is a large multimodal model (LMM) trained using the [LLaVA-v1.5 framework](https://arxiv.org/abs/2310.03744) with the 7-billion parameter [google/gemma-7b-it](https://huggingface.co/google/gemma-7b-it) model as language backbone and the CLIP-based vision encoder.
**_NOTE:_** As of 06/03/2024, we have not yet converted the weights of this model to the HuggingFace LLaVA format. This model card will be updated when we do.
| Model Details | Description |
| ----------- | ----------- |
| Authors | Intel: [Musashi Hinck](https://huggingface.co/musashihinck), [Matthew Olson](https://huggingface.co/matthewlyleolson), [David Cobbley](https://huggingface.co/djcobble), [Shao-Yen Tseng](https://huggingface.co/shaoyent), [Vasudev Lal](https://huggingface.co/vasudevlal) |
| Date | March 2024 |
| Version | 1 |
| Type | Large multimodal model (LMM) |
| Paper or Other Resources | [LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model](https://arxiv.org/abs/2404.01331) |
| License | [Gemma](https://ai.google.dev/gemma/terms) |
| Questions or Comments | [Community Tab](https://huggingface.co/Intel/llava-gemma-7b/discussions) and [Intel DevHub Discord](https://discord.gg/rv2Gp55UJQ)|
This model card was created by [Benjamin Consolvo](https://huggingface.co/bconsolvo) and the authors listed above.
## Intended Use
| Intended Use | Description |
| ----------- | ----------- |
| Primary intended uses | The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot. |
| Primary intended users | Anyone using or evaluating multimodal models. |
| Out-of-scope uses | This model is not intended for uses that require high levels of factuality, high stakes situations, mental health or medical applications, generating misinformation or disinformation, impersonating others, facilitating or inciting harassment or violence, any use that could lead to the violation of a human right under the UN Declaration of Human Rights. |
### How to use
Currently, using `llava-gemma` requires a [modified preprocessor](./processing_llavagemma.py). _We are currently working on modifying the `LlavaProcessor` class to streamline usage (see [PR #30030](https://github.com/huggingface/transformers/pull/30030)). Expect updates soon._
For current usage, see [`usage.py`](./usage.py) or the following code block:
```python
import requests
from PIL import Image
from transformers import (
LlavaForConditionalGeneration,
AutoTokenizer,
CLIPImageProcessor
)
from processing_llavagemma import LlavaGemmaProcessor # This is in this repo
checkpoint = "Intel/llava-gemma-7b"
# Load model
model = LlavaForConditionalGeneration.from_pretrained(checkpoint)
processor = LlavaGemmaProcessor(
tokenizer=AutoTokenizer.from_pretrained(checkpoint),
image_processor=CLIPImageProcessor.from_pretrained(checkpoint)
)
# Prepare inputs
# Use gemma chat template
prompt = processor.tokenizer.apply_chat_template(
[{'role': 'user', 'content': "<image>\nWhat's the content of the image?"}],
tokenize=False,
add_generation_prompt=True
)
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")
# Generate
generate_ids = model.generate(**inputs, max_length=30)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)
```
For straightforward use as a chatbot (without images), you can modify the last portion of code to the following:
```python
# Prepare inputs
# Use gemma chat template
prompt = processor.tokenizer.apply_chat_template(
[{'role': 'user', 'content': "Summarize the following paragraph? In this paper, we introduced LLaVA-Gemma, a compact vision-language model leveraging the Gemma Large Language Model in two variants, Gemma-2B and Gemma-7B. Our work provides a unique opportunity for researchers to explore the trade-offs between computational efficiency and multimodal understanding in small-scale models. The availability of both variants allows for a comparative analysis that sheds light on how model size impacts performance in various tasks. Our evaluations demonstrate the versatility and effectiveness of LLaVA-Gemma across a range of datasets, highlighting its potential as a benchmark for future research in small-scale vision-language models. With these models, future practitioners can optimize the performance of small-scale multimodal models more directly."}],
tokenize=False,
add_generation_prompt=True
)
# url = "https://www.ilankelman.org/stopsigns/australia.jpg"
# image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=None, return_tensors="pt")
# Generate
generate_ids = model.generate(**inputs, max_length=300)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)
```
## Factors
| Factors | Description |
| ----------- | ----------- |
| Groups | - |
| Instrumentation | - |
| Environment | Trained for 4 hours on 8 Intel Gaudi 2 AI accelerators. |
| Card Prompts | Model training and deployment on alternate hardware and software will change model performance |
## Metrics
| Metrics | Description |
| ----------- | ----------- |
| Model performance measures | We evaluate the LlaVA-Gemma models on a similar collection of benchmarks to other LMM works: GQA; MME; MM-Vet; POPE (accuracy and F1); VQAv2; MMVP; the image subset of ScienceQA. Our experiments provide insights into the efficacy of various design choices within the LLaVA framework. |
| Decision thresholds | - |
| Approaches to uncertainty and variability | - |
## Training Data
The model was trained using the LLaVA-v1.5 data mixture. This is listed as follows:
- 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
- 158K GPT-generated multimodal instruction-following data.
- 450K academic-task-oriented VQA data mixture.
- 40K ShareGPT data.
## Quantitative Analyses
Performance of LLaVA-Gemma models across seven benchmarks. Highlighted box indicates strongest performance amongst LLaVA-Gemma models. Bottom two rows show self-reported performance of Llava Phi-2 and LLaVA-v1.5 respectively. The bolded **gemma-7b-it** is the current model used here in this model card.
| LM Backbone | Vision Model | Pretrained Connector | GQA | MME cognition | MME perception | MM-Vet | POPE accuracy | POPE F1 | VQAv2 | ScienceQA Image | MMVP |
| ----------- | ------------ | -------------------- | ----- | ------------- | -------------- | ------ | ------------- | ------- | ----- | --------------- | ----- |
| gemma-2b-it | CLIP | Yes | 0.531 | 236 | 1130 | 17.7 | 0.850 |<mark>0.839</mark>| 70.65 | 0.564 | 0.287 |
| gemma-2b-it | CLIP | No | 0.481 | 248 | 935 | 13.1 | 0.784 | 0.762 | 61.74 | 0.549 | 0.180 |
| gemma-2b-it | DinoV2 | Yes |<mark>0.587</mark>| 307| <mark>1133</mark> |<mark>19.1</mark>| <mark>0.853</mark> | 0.838 |<mark>71.37</mark>| 0.555 | 0.227 |
| gemma-2b-it | DinoV2 | No | 0.501 | <mark>309</mark>| 959 | 14.5 | 0.793 | 0.772 | 61.65 | 0.568 | 0.180 |
| | | | | | | | | | | | |
| **gemma-7b-it** | CLIP | Yes | 0.472 | 253 | 895 | 18.2 | 0.848 | 0.829 | 68.7 | 0.625 | <mark>0.327</mark> |
| gemma-7b-it | CLIP | No | 0.472 | 278 | 857 | 19.1 | 0.782 | 0.734 | 65.1 | <mark>0.636</mark> | 0.240 |
| gemma-7b-it | DinoV2 | Yes | 0.519 | 257 | 1021 | 14.3 | 0.794 | 0.762 | 65.2 | 0.628 | <mark>0.327</mark> |
| gemma-7b-it | DinoV2 | No | 0.459 | 226 | 771 | 12.2 | 0.693 | 0.567 | 57.4 | 0.598 | 0.267 |
| | | | | | | | | | | | |
| Phi-2b | CLIP | Yes | - | - | 1335 | 28.9 | - | 0.850 | 71.4 | 0.684 | - |
| Llama-2-7b | CLIP | Yes | 0.620 | 348 | 1511 | 30.6 | 0.850 | 0.859 | 78.5 | 0.704 | 46.1 |
## Ethical Considerations
Intel is committed to respecting human rights and avoiding causing or contributing to adverse impacts on human rights. See [Intel’s Global Human Rights Principles](https://www.intel.com/content/dam/www/central-libraries/us/en/documents/policy-human-rights.pdf). Intel’s products and software are intended only to be used in applications that do not cause or contribute to adverse impacts on human rights.
| Ethical Considerations | Description |
| ----------- | ----------- |
| Data | The model was trained using the LLaVA-v1.5 data mixture as described above. |
| Human life | The model is not intended to inform decisions central to human life or flourishing. |
| Mitigations | No additional risk mitigation strategies were considered during model development. |
| Risks and harms | This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm. |
| Use cases | - |
## Caveats and Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
## Citation details
```bibtex
@misc{hinck2024llavagemma,
title={LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model},
author={Musashi Hinck and Matthew L. Olson and David Cobbley and Shao-Yen Tseng and Vasudev Lal},
year={2024},
eprint={2404.01331},
url={https://arxiv.org/abs/2404.01331},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
``` |