-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper β’ 2409.17146 β’ Published β’ 106 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper β’ 2409.12191 β’ Published β’ 76 -
mistralai/Pixtral-12B-2409
Image-Text-to-Text β’ Updated β’ 567 -
HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text β’ Updated β’ 47.4k β’ 313
Collections
Discover the best community collections!
Collections including paper arxiv:2409.17146
-
An Introduction to Vision-Language Modeling
Paper β’ 2405.17247 β’ Published β’ 87 -
Visual Instruction Tuning
Paper β’ 2304.08485 β’ Published β’ 13 -
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 37 -
PALO: A Polyglot Large Multimodal Model for 5B People
Paper β’ 2402.14818 β’ Published β’ 23
-
Building and better understanding vision-language models: insights and future directions
Paper β’ 2408.12637 β’ Published β’ 124 -
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Paper β’ 2408.11039 β’ Published β’ 58 -
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper β’ 2408.16725 β’ Published β’ 53 -
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Paper β’ 2408.15998 β’ Published β’ 84
-
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper β’ 2408.10188 β’ Published β’ 51 -
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper β’ 2408.08872 β’ Published β’ 98 -
Building and better understanding vision-language models: insights and future directions
Paper β’ 2408.12637 β’ Published β’ 124 -
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Paper β’ 2408.12528 β’ Published β’ 51
-
VILA^2: VILA Augmented VILA
Paper β’ 2407.17453 β’ Published β’ 40 -
Octopus v4: Graph of language models
Paper β’ 2404.19296 β’ Published β’ 116 -
Octo-planner: On-device Language Model for Planner-Action Agents
Paper β’ 2406.18082 β’ Published β’ 48 -
Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models
Paper β’ 2408.15518 β’ Published β’ 42
-
RLHF Workflow: From Reward Modeling to Online RLHF
Paper β’ 2405.07863 β’ Published β’ 66 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper β’ 2405.09818 β’ Published β’ 127 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper β’ 2405.15574 β’ Published β’ 53 -
An Introduction to Vision-Language Modeling
Paper β’ 2405.17247 β’ Published β’ 87