Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering Paper • 2411.16863 • Published Nov 25, 2024
Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization Paper • 2408.14547 • Published Aug 26, 2024
Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization Paper • 2408.14547 • Published Aug 26, 2024
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models Paper • 2311.16254 • Published Nov 27, 2023
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models Paper • 2311.16254 • Published Nov 27, 2023
aimagelab/LLaVA_MORE-llama_3_1-8B-S2-siglip-finetuning Image-Text-to-Text • Updated Aug 16, 2024 • 17 • 2
aimagelab/LLaVA_MORE-llama_3_1-8B-siglip-finetuning Image-Text-to-Text • Updated Aug 16, 2024 • 17 • 1
Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities Paper • 2407.20337 • Published Jul 29, 2024
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs Paper • 2404.15406 • Published Apr 23, 2024