Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos Paper • 2501.04001 • Published 11 days ago • 40
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Paper • 2501.03895 • Published 11 days ago • 48
An Empirical Study of Autoregressive Pre-training from Videos Paper • 2501.05453 • Published 9 days ago • 36
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training Paper • 2501.07556 • Published 5 days ago • 5