File size: 6,068 Bytes
690d4c0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0cfe85
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
license: gpl-3.0
pipeline_tag: image-text-to-text
---
# LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

[![arXiv](https://img.shields.io/badge/arXiv-2501.03895-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2501.03895)
[![model](https://img.shields.io/badge/%F0%9F%A4%97%20huggingface%20-llava--mini--llama--3.1--8b-orange.svg)](https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b)

> **[Shaolei Zhang](https://zhangshaolei1998.github.io/), [Qingkai Fang](https://fangqingkai.github.io/), [Zhe Yang](https://nlp.ict.ac.cn/yjdw/xs/ssyjs/202210/t20221020_52708.html), [Yang Feng*](https://people.ucas.edu.cn/~yangfeng?language=en)**


LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Guided by the interpretability within LMM, LLaVA-Mini significantly improves efficiency while ensuring vision capabilities. [Code](https://github.com/ictnlp/LLaVA-Mini), [model](https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b) and [demo](https://github.com/ictnlp/LLaVA-Mini#-demo) of LLaVA-Mini are available now!

Refer to our [GitHub repo]((https://github.com/ictnlp/LLaVA-Mini)) for details of LLaVA-Mini!

> [!Note]
> LLaVA-Mini only requires **1 token** to represent each image, which improves the efficiency of image and video understanding, including:
> - **Computational effort**: 77% FLOPs reduction
> - **Response latency**: reduce from 100 milliseconds to 40 milliseconds
> - **VRAM memory usage**: reduce from 360 MB/image to 0.6 MB/image, support 3-hour video processing


<p align="center" width="100%">
<img src="./assets/performance.png" alt="performance" style="width: 100%; min-width: 300px; display: block; margin: auto;">
</p>

💡**Highlight**:
1. **Good Performance**: LLaVA-Mini achieves performance comparable to LLaVA-v1.5 while using only 1 vision token instead of 576 (compression rate of 0.17%).
2. **High Efficiency**: LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.
3. **Insights**: To develop LLaVA-Mini, which reduces vision tokens while maintaining visual understanding, we conduct a preliminary analysis to explore how large multimodal models (LMMs) process visual tokens. Please refer to our [paper](https://arxiv.org/pdf/2501.03895) for a detailed analysis and our conclusions.

## 🖥 Demo
<p align="center" width="100%">
<img src="./assets/llava_mini.gif" alt="llava_mini" style="width: 100%; min-width: 300px; display: block; margin: auto;">
</p>

- Download LLaVA-Mini model from [here](https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b).

- Run these scripts and Interact with LLaVA-Mini in your browser:

  ```bash
  # Launch a controller
  python -m llavamini.serve.controller --host 0.0.0.0 --port 10000 &

  # Build the API of LLaVA-Mini
  CUDA_VISIBLE_DEVICES=0  python -m llavamini.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ICTNLP/llava-mini-llama-3.1-8b --model-name llava-mini &

  # Start the interactive interface
  python -m llavamini.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload  --port 7860
  ```

## 🔥 Quick Start
### Requirements
- Install packages:

  ```bash
  conda create -n llavamini python=3.10 -y
  conda activate llavamini
  pip install -e .
  pip install -e ".[train]"
  pip install flash-attn --no-build-isolation
  ```

### Command Interaction
- Image understanding, using `--image-file `:

  ```bash
  # Image Understanding
  CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \
      --model-path  ICTNLP/llava-mini-llama-3.1-8b \
      --image-file llavamini/serve/examples/baby_cake.png \
      --conv-mode llava_llama_3_1 --model-name "llava-mini" \
      --query "What's the text on the cake?"
  ```

- Video understanding, using `--video-file `:

  ```bash
  # Video Understanding
  CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \
      --model-path  ICTNLP/llava-mini-llama-3.1-8b \
      --video-file llavamini/serve/examples/fifa.mp4 \
      --conv-mode llava_llama_3_1 --model-name "llava-mini" \
      --query "What happened in this video?"
  ```

### Reproduction and Evaluation

- Refer to [Evaluation.md](docs/Evaluation.md) for the evaluation of LLaVA-Mini on image/video benchmarks.

### Cases
- LLaVA-Mini achieves high-quality image understanding and video understanding.

<p align="center" width="100%">
<img src="./assets/case1.png" alt="case1" style="width: 100%; min-width: 300px; display: block; margin: auto;">
</p>

<details>
<summary>More cases</summary>
<p align="center" width="100%">
<img src="./assets/case2.png" alt="case2" style="width: 100%; min-width: 300px; display: block; margin: auto;">
</p>

<p align="center" width="100%">
<img src="./assets/case3.png" alt="case3" style="width: 100%; min-width: 300px; display: block; margin: auto;">
</p>

<p align="center" width="100%">
<img src="./assets/case4.png" alt="case4" style="width: 100%; min-width: 300px; display: block; margin: auto;">
</p>

</details>

- LLaVA-Mini dynamically compresses image to capture important visual information (brighter areas are more heavily weighted during compression).

<p align="center" width="100%">
<img src="./assets/compression.png" alt="compression" style="width: 100%; min-width: 300px; display: block; margin: auto;">
</p>



## 🖋Citation

If this repository is useful for you, please cite as:

```
@misc{llavamini,
      title={LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token}, 
      author={Shaolei Zhang and Qingkai Fang and Zhe Yang and Yang Feng},
      year={2025},
      eprint={2501.03895},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.03895}, 
}
```

If you have any questions, please feel free to submit an issue or contact `[email protected]`.