cooleel
's Collections
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
Paper
•
2410.16153
•
Published
•
44
AutoTrain: No-code training for state-of-the-art models
Paper
•
2410.15735
•
Published
•
59
The Curse of Multi-Modalities: Evaluating Hallucinations of Large
Multimodal Models across Language, Visual, and Audio
Paper
•
2410.12787
•
Published
•
31
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks
Paper
•
2410.01744
•
Published
•
26
UCFE: A User-Centric Financial Expertise Benchmark for Large Language
Models
Paper
•
2410.14059
•
Published
•
55
NVLM: Open Frontier-Class Multimodal LLMs
Paper
•
2409.11402
•
Published
•
73
MIO: A Foundation Model on Multimodal Tokens
Paper
•
2409.17692
•
Published
•
53
Emu3: Next-Token Prediction is All You Need
Paper
•
2409.18869
•
Published
•
94
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
Multimodal Models
Paper
•
2409.17146
•
Published
•
106
Analyzing The Language of Visual Tokens
Paper
•
2411.05001
•
Published
•
23
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM
Data Contamination
Paper
•
2411.03823
•
Published
•
43
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference
Acceleration
Paper
•
2410.02367
•
Published
•
47
VLM2Vec: Training Vision-Language Models for Massive Multimodal
Embedding Tasks
Paper
•
2410.05160
•
Published
•
4
Paper
•
2410.07073
•
Published
•
63
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large
Multimodal Models
Paper
•
2410.09732
•
Published
•
54
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid
Visual Redundancy Reduction
Paper
•
2410.17247
•
Published
•
45
SageAttention2 Technical Report: Accurate 4 Bit Attention for
Plug-and-play Inference Acceleration
Paper
•
2411.10958
•
Published
•
52
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of
Experts
Paper
•
2411.10669
•
Published
•
10
Autoregressive Models in Vision: A Survey
Paper
•
2411.05902
•
Published
•
17
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Paper
•
2411.04997
•
Published
•
37
Mixture-of-Transformers: A Sparse and Scalable Architecture for
Multi-Modal Foundation Models
Paper
•
2411.04996
•
Published
•
50
Expanding Performance Boundaries of Open-Source Multimodal Models with
Model, Data, and Test-Time Scaling
Paper
•
2412.05271
•
Published
•
123
Progressive Multimodal Reasoning via Active Retrieval
Paper
•
2412.14835
•
Published
•
71
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
Paper
•
2412.14475
•
Published
•
52
POINTS1.5: Building a Vision-Language Model towards Real World
Applications
Paper
•
2412.08443
•
Published
•
38
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive
Survey
Paper
•
2412.18619
•
Published
•
49
FastVLM: Efficient Vision Encoding for Vision Language Models
Paper
•
2412.13303
•
Published
•
13