File size: 3,782 Bytes
49ab358
 
 
2028d59
49ab358
 
 
 
 
 
 
 
 
 
 
 
 
262910b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49ab358
 
 
 
 
 
262910b
 
 
 
 
 
 
 
 
 
 
 
 
49ab358
 
 
 
 
 
262910b
 
 
 
 
 
 
 
 
 
 
 
 
49ab358
 
 
 
 
 
262910b
 
49ab358
 
 
 
 
 
262910b
 
49ab358
 
 
 
 
 
262910b
 
49ab358
 
 
 
 
 
262910b
 
 
49ab358
 
 
 
 
 
 
dd0074c
49ab358
 
 
 
 
 
 
c079e80
49ab358
 
 
 
 
243adcd
 
 
 
 
 
 
 
 
 
 
dd0074c
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
license: apache-2.0
datasets:
- wchai/AuroraCap-trainset
base_model:
- lmsys/vicuna-7b-v1.5-16k
tags:
- caption
model-index:
- name: AuroraCap-7B
  results:
  - task:
      type: video detailed caption
    dataset:
      type: VDC
      name: VDC
    metrics:
    - type: Acc
      value: 38.21
      name: VDCScore
    - type: Acc
      value: 48.33
      name: VDD
    - type: cider
      value: 9.51
    - type: bleu
      value: 30.9
      name: bleu@1
    - type: bleu
      value: 4.06
      name: bleu@4
    - type: meteor
      value: 19.09
    - type: rouge
      value: 21.58
      name: rouge-l
  - task:
      type: video caption
    dataset:
      type: MSR-VTT
      name: NSR-VTT
    metrics:
    - type: cider
      value: 33.1
    - type: bleu
      value: 58.6
      name: bleu@1
    - type: bleu
      value: 21
      name: bleu@4
    - type: meteor
      value: 23.9
    - type: rouge
      value: 49.5
      name: rouge-l
  - task:
      type: video caption
    dataset:
      type: VATEX
      name: VATEX
    metrics:
    - type: cider
      value: 33.8
    - type: bleu
      value: 57.1
      name: bleu@1
    - type: bleu
      value: 18.4
      name: bleu@4
    - type: meteor
      value: 19
    - type: rouge
      value: 40.8
      name: rouge-l
  - task:
      type: video question anwering
    dataset:
      type: ActivityNet
      name: ActivityNet
    metrics:
    - type: Acc
      value: 61.8
  - task:
      type: video question anwering
    dataset:
      type: MSVD
      name: MSVD
    metrics:
    - type: Acc
      value: 62.6
  - task:
      type: video question anwering
    dataset:
      type: MSR-VTT
      name: MSR-VTT
    metrics:
    - type: Acc
      value: 43.5
  - task:
      type: video question anwering
    dataset:
      type: iVQA
      name: iVQA
    metrics:
    - type: Acc
      value: 55.2
pipeline_tag: video-text-to-text
---

<img src="assets/teaser.png" align="center">

## Resources

- [Website](https://rese1f.github.io/aurora-web/)
- [arXiv: Paper](https://arxiv.org/abs/2410.03051)
- [GitHub: Code](https://github.com/rese1f/aurora)
- [Huggingface: AuroraCap Model](https://huggingface.co/collections/Reself/auroracap-66d117ffe13bedda96702013)
- [Huggingface: VDC Benchmark](https://huggingface.co/datasets/Reself/Video-Detailed-Caption)
- [Huggingface: Trainset](https://huggingface.co/datasets/Reself/AuroraCap-trainset)
  
## Features

<img src="assets/assets_vdc_baseline.png" align="center">

AuroraCap is a multimodal large language model for image and video captioning. 

## Quick Start
See [Docs](https://github.com/rese1f/aurora/blob/main/docs/auroracap/README.md).

## FAQ

Q: Can I only use token merging during inference?

A: No, our experiments show that token merging is also a way to accelerate training while maintaining similar performance. Additionally, besides auroracap, you can also use token merging on other llava-like models.

Q: Why do we provide both official LLaVA-format and Xtuner format weights for AuroraCap?

A: While Xtuner supports saving checkpoints in multiple formats, it currently only allows continued training with the Xtuner format. Therefore, we currently provide the model in the Xtuner format for both continued training and inference. In the future, we will provide the model in the official LLaVA format for both training and inference, enabling quicker SGLang deployment and integration with the transformers.

## Citation

```
@article{chai2024auroracap,
  title={AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark },
  author={Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning},
  journal={arXiv preprint arXiv:2410.03051},
  year={2024}
}
```