|
--- |
|
license: mit |
|
inference: false |
|
pipeline_tag: image-to-text |
|
tags: |
|
- image-captioning |
|
--- |
|
# FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions |
|
|
|
A framework designed to generate semantically rich image captions. |
|
|
|
## Resources |
|
|
|
- π» **Project Page**: For more details, visit the official [project page](https://rotsteinnoam.github.io/FuseCap/). |
|
|
|
- π **Read the Paper**: You can find the paper [here](https://arxiv.org/abs/2305.17718). |
|
|
|
- π **Demo**: Try out our BLIP-based model [demo](https://huggingface.co/spaces/noamrot/FuseCap) trained using FuseCap. |
|
|
|
- π **Code Repository**: The code for FuseCap can be found in the [GitHub repository](https://github.com/RotsteinNoam/FuseCap). |
|
|
|
- ποΈ **Datasets**: The fused captions datasets can be accessed from [here](https://github.com/RotsteinNoam/FuseCap#datasets). |
|
|
|
#### Running the model |
|
|
|
Our BLIP-based model can be run using the following code, |
|
|
|
```python |
|
import requests |
|
from PIL import Image |
|
from transformers import BlipProcessor, BlipForConditionalGeneration |
|
import torch |
|
|
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
processor = BlipProcessor.from_pretrained("noamrot/FuseCap") |
|
model = BlipForConditionalGeneration.from_pretrained("noamrot/FuseCap").to(device) |
|
|
|
img_url = 'https://huggingface.co/spaces/noamrot/FuseCap/resolve/main/bike.jpg' |
|
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') |
|
|
|
text = "a picture of " |
|
inputs = processor(raw_image, text, return_tensors="pt").to(device) |
|
|
|
out = model.generate(**inputs, num_beams = 3) |
|
print(processor.decode(out[0], skip_special_tokens=True)) |
|
``` |
|
|
|
## Upcoming Updates |
|
|
|
The official codebase, datasets and trained models for this project will be released soon. |
|
|
|
## BibTeX |
|
|
|
``` Citation |
|
@inproceedings{rotstein2024fusecap, |
|
title={Fusecap: Leveraging large language models for enriched fused image captions}, |
|
author={Rotstein, Noam and Bensa{\"\i}d, David and Brody, Shaked and Ganz, Roy and Kimmel, Ron}, |
|
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision}, |
|
pages={5689--5700}, |
|
year={2024} |
|
} |
|
``` |