2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper ā¢ 2501.00958 ā¢ Published 17 days ago ā¢ 95
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Paper ā¢ 2501.03895 ā¢ Published 11 days ago ā¢ 48
Table Transformer Collection The Table Transformer (TATR) is a series of object detection models useful for table extraction from PDF images. ā¢ 5 items ā¢ Updated 10 days ago ā¢ 20