-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 181 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 15 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 48 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 41
Collections
Discover the best community collections!
Collections including paper arxiv:2311.05437
-
Attention Is All You Need
Paper • 1706.03762 • Published • 50 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 16 -
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Paper • 1907.11692 • Published • 7 -
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Paper • 1910.01108 • Published • 14
-
Extending Context Window of Large Language Models via Semantic Compression
Paper • 2312.09571 • Published • 12 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 48 -
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
Paper • 2312.02949 • Published • 11 -
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper • 2402.14289 • Published • 19
-
Visual In-Context Prompting
Paper • 2311.13601 • Published • 16 -
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework
Paper • 2308.08155 • Published • 3 -
LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models
Paper • 2303.02927 • Published • 3 -
The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4
Paper • 2311.07361 • Published • 12
-
LayoutPrompter: Awaken the Design Ability of Large Language Models
Paper • 2311.06495 • Published • 10 -
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models
Paper • 2311.06783 • Published • 26 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 48 -
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models
Paper • 2311.04589 • Published • 18
-
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 48 -
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving
Paper • 2311.05332 • Published • 9 -
SoundCam: A Dataset for Finding Humans Using Room Acoustics
Paper • 2311.03517 • Published • 10