L-Hongbin
's Collections
MutiModal_Paper
updated
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper
•
2410.13861
•
Published
•
53
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified
Multimodal Understanding and Generation
Paper
•
2411.07975
•
Published
•
27
Enhancing the Reasoning Ability of Multimodal Large Language Models via
Mixed Preference Optimization
Paper
•
2411.10442
•
Published
•
71
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper
•
2411.14402
•
Published
•
43
DINO-X: A Unified Vision Model for Open-World Object Detection and
Understanding
Paper
•
2411.14347
•
Published
•
13
Large Multi-modal Models Can Interpret Features in Large Multi-modal
Models
Paper
•
2411.14982
•
Published
•
16
Efficient Long Video Tokenization via Coordinated-based Patch
Reconstruction
Paper
•
2411.14762
•
Published
•
11
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic
Vision-Language Negatives
Paper
•
2411.02545
•
Published
•
1
Hymba: A Hybrid-head Architecture for Small Language Models
Paper
•
2411.13676
•
Published
•
40
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking
with Motion-Aware Memory
Paper
•
2411.11922
•
Published
•
18
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
•
2411.17465
•
Published
•
77
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for
Training-Free Acceleration
Paper
•
2411.17686
•
Published
•
19
DreamMix: Decoupling Object Attributes for Enhanced Editability in
Customized Image Inpainting
Paper
•
2411.17223
•
Published
•
5
FINECAPTION: Compositional Image Captioning Focusing on Wherever You
Want at Any Granularity
Paper
•
2411.15411
•
Published
•
7
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A
Comprehensive Multimodal Dataset Towards General Medical AI
Paper
•
2411.14522
•
Published
•
31
Knowledge Transfer Across Modalities with Natural Language Supervision
Paper
•
2411.15611
•
Published
•
15
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Paper
•
2411.18363
•
Published
•
9
EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State
Space Duality
Paper
•
2411.15241
•
Published
•
5
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient
Paper
•
2411.17787
•
Published
•
11
On Domain-Specific Post-Training for Multimodal Large Language Models
Paper
•
2411.19930
•
Published
•
25
One Token to Seg Them All: Language Instructed Reasoning Segmentation in
Videos
Paper
•
2409.19603
•
Published
•
19
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
Understanding
Paper
•
2406.19389
•
Published
•
52
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and
Pruning
Paper
•
2412.03248
•
Published
•
26
CompCap: Improving Multimodal Large Language Models with Composite
Captions
Paper
•
2412.05243
•
Published
•
18
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion
Paper
•
2412.04424
•
Published
•
59
POINTS1.5: Building a Vision-Language Model towards Real World
Applications
Paper
•
2412.08443
•
Published
•
38
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
•
2412.08737
•
Published
•
52
SynerGen-VL: Towards Synergistic Image Understanding and Generation with
Vision Experts and Token Folding
Paper
•
2412.09604
•
Published
•
35
Learned Compression for Compressed Learning
Paper
•
2412.09405
•
Published
•
11
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via
Hierarchical Window Transformer
Paper
•
2412.13871
•
Published
•
18
AnySat: An Earth Observation Model for Any Resolutions, Scales, and
Modalities
Paper
•
2412.14123
•
Published
•
11
FastVLM: Efficient Vision Encoding for Vision Language Models
Paper
•
2412.13303
•
Published
•
13
Exploring Multi-Grained Concept Annotations for Multimodal Large
Language Models
Paper
•
2412.05939
•
Published
•
13
Grounding Descriptions in Images informs Zero-Shot Visual Recognition
Paper
•
2412.04429
•
Published
Viewer
•
Updated
•
1.09M
•
48
•
2