kevin1020
's Collections
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
Paper
•
2403.12596
•
Published
•
9
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
•
2404.13013
•
Published
•
30
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
•
2404.16994
•
Published
•
35
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment
Capability
Paper
•
2405.14129
•
Published
•
12
Dense Connector for MLLMs
Paper
•
2405.13800
•
Published
•
22
Merlin:Empowering Multimodal LLMs with Foresight Minds
Paper
•
2312.00589
•
Published
•
24
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language
Understanding
Paper
•
2407.15754
•
Published
•
20
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
•
2407.15841
•
Published
•
40
Efficient Inference of Vision Instruction-Following Models with Elastic
Cache
Paper
•
2407.18121
•
Published
•
17
VideoLLaMB: Long-context Video Understanding with Recurrent Memory
Bridges
Paper
•
2409.01071
•
Published
•
27
LongVLM: Efficient Long Video Understanding via Large Language Models
Paper
•
2404.03384
•
Published
Visual Context Window Extension: A New Perspective for Long Video
Understanding
Paper
•
2409.20018
•
Published
•
10
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality
Documents
Paper
•
2410.10594
•
Published
•
24