merve
's Collections
MIT Talk 31/10 Papers
updated
NVLM: Open Frontier-Class Multimodal LLMs
Paper
β’
2409.11402
β’
Published
β’
73
BRAVE: Broadening the visual encoding of vision-language models
Paper
β’
2404.07204
β’
Published
β’
18
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
β’
2403.18814
β’
Published
β’
45
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
Multimodal Models
Paper
β’
2409.17146
β’
Published
β’
106
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
Multimodal Models
Paper
β’
2407.07895
β’
Published
β’
40
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Paper
β’
2409.01704
β’
Published
β’
83
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
β’
2409.12191
β’
Published
β’
76
Unifying Multimodal Retrieval via Document Screenshot Embedding
Paper
β’
2406.11251
β’
Published
β’
10
LLaVA-OneVision: Easy Visual Task Transfer
Paper
β’
2408.03326
β’
Published
β’
60
ColPali: Efficient Document Retrieval with Vision Language Models
Paper
β’
2407.01449
β’
Published
β’
42
Paper
β’
2410.07073
β’
Published
β’
63
Building and better understanding vision-language models: insights and
future directions
Paper
β’
2408.12637
β’
Published
β’
124
PaliGemma: A versatile 3B VLM for transfer
Paper
β’
2407.07726
β’
Published
β’
68
Sigmoid Loss for Language Image Pre-Training
Paper
β’
2303.15343
β’
Published
β’
6