OpenGVLab
/

VideoChat-Flash-Qwen2_5-2B_res448

Video-Text-to-Text

videochat_flash_qwen

feature-extraction

Model card Files Files and versions Community

lixinhao commited on about 12 hours ago

Commit

1df3392

·

verified ·

1 Parent(s): ee222dc

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -9,7 +9,7 @@ tags:
 - multimodal
 pipeline_tag: video-text-to-text
 model-index:
-- name: VideoChat-Flash-Qwen2_5-2B_res448
   results:
   - task:
       type: multimodal
@@ -78,7 +78,7 @@ model-index:
 # 🦜VideoChat-Flash-Qwen2_5-2B_res448⚡
 [\[📰 Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) [\[📂 GitHub\]](https://github.com/OpenGVLab/VideoChat-Flash)   [\[📜 Tech Report\]](https://www.arxiv.org/abs/2501.00574) [\[🗨️ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash)
-VideoChat-Flash-2B is constructed upon UMT-L (300M) and Qwen2_5-2B, employing only **16 tokens per frame**. By leveraging Yarn to extend the context window to 128k (Qwen2's native context window is 32k), our model supports input sequences of up to approximately **10,000 frames**.
 > Note: Due to a predominantly English training corpus, the model only exhibits basic Chinese comprehension, to ensure optimal performance, using English for interaction is recommended.

 - multimodal
 pipeline_tag: video-text-to-text
 model-index:
+- name: VideoChat-Flash-Qwen2_5-1_5B_res448
   results:
   - task:
       type: multimodal
 # 🦜VideoChat-Flash-Qwen2_5-2B_res448⚡
 [\[📰 Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) [\[📂 GitHub\]](https://github.com/OpenGVLab/VideoChat-Flash)   [\[📜 Tech Report\]](https://www.arxiv.org/abs/2501.00574) [\[🗨️ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash)
+VideoChat-Flash-2B is constructed upon UMT-L (300M) and Qwen2.5-1.5B, employing only **16 tokens per frame**. By leveraging Yarn to extend the context window to 128k (Qwen2's native context window is 32k), our model supports input sequences of up to approximately **10,000 frames**.
 > Note: Due to a predominantly English training corpus, the model only exhibits basic Chinese comprehension, to ensure optimal performance, using English for interaction is recommended.