File size: 2,148 Bytes
842136c 1a4b1f8 c0e9512 842136c e361352 2355c93 5b1dea6 2355c93 3be0e04 e977b11 de54ab3 3be0e04 6540118 2355c93 6540118 2355c93 b6793ca 2355c93 842136c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
---
license: mit
inference: false
pipeline_tag: image-to-text
tags:
- image-captioning
---
# FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions
A framework designed to generate semantically rich image captions.
## Resources
- π» **Project Page**: For more details, visit the official [project page](https://rotsteinnoam.github.io/FuseCap/).
- π **Read the Paper**: You can find the paper [here](https://arxiv.org/abs/2305.17718).
- π **Demo**: Try out our BLIP-based model [demo](https://huggingface.co/spaces/noamrot/FuseCap) trained using FuseCap.
- π **Code Repository**: The code for FuseCap can be found in the [GitHub repository](https://github.com/RotsteinNoam/FuseCap).
- ποΈ **Datasets**: The fused captions datasets can be accessed from [here](https://github.com/RotsteinNoam/FuseCap#datasets).
#### Running the model
Our BLIP-based model can be run using the following code,
```python
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
processor = BlipProcessor.from_pretrained("noamrot/FuseCap")
model = BlipForConditionalGeneration.from_pretrained("noamrot/FuseCap").to(device)
img_url = 'https://huggingface.co/spaces/noamrot/FuseCap/resolve/main/bike.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
text = "a picture of "
inputs = processor(raw_image, text, return_tensors="pt").to(device)
out = model.generate(**inputs, num_beams = 3)
print(processor.decode(out[0], skip_special_tokens=True))
```
## Upcoming Updates
The official codebase, datasets and trained models for this project will be released soon.
## BibTeX
``` Citation
@inproceedings{rotstein2024fusecap,
title={Fusecap: Leveraging large language models for enriched fused image captions},
author={Rotstein, Noam and Bensa{\"\i}d, David and Brody, Shaked and Ganz, Roy and Kimmel, Ron},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
pages={5689--5700},
year={2024}
}
``` |