--- license: other license_name: seallms license_link: https://huggingface.co/SeaLLMs/SeaLLM-13B-Chat/blob/main/LICENSE language: - en - zh - vi - id - th - ms - km - lo - my - tl tags: - multilingual - sea ---
> SeaLLM will be able to "see"! # *SeaLMMM-7B* - Large Multilingual Multimodal Models for Southeast Asia
Website 🤗 Tech Memo 🤗 DEMO Github Technical Report
We introduce and [showcase](https://huggingface.co/spaces/SeaLLMs/SeaLLM-7B) the first iteration of [SeaLMMM](https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1) -- A unified multilingual and multimodal that excel in both text-only and vision tasks in multiple languages spoken in Southeast Asia. ### SeaLMMM-7B abilities * SeaLMMM-7B is one of the strongest 7B vision-language models at **text-only tasks**, with performance similar to [SeaLLM-7B-v2](https://huggingface.co/SeaLLMs/SeaLLM-7B-v2). It is a text-first-vision-second model. * SeaLMMM-7B **is** able to handle most SEA languages, making it more multilingual than En-only LLava, Bilingual (En+Zh) Qwen-VL or Yi-VL. * Unlike LLava or specialized VLMs, which demand only one image at the begining, SeaLMMM-7B can seamlessly handle text-only conversations at the begining and visual instructions in the middle of the conversations and support topic and language switching. * SeaLMMM-7B can carry multi-image generation or in-context visual learning, in which case, [Better llava next](https://github.com/huggingface/transformers/pull/29850) should be applied to enable such feature. ### Release and DEMO - DEMO: [SeaLLMs/SeaLLM-7b](https://huggingface.co/spaces/SeaLLMs/SeaLLM-7B). - Model weights: - [SeaLMMM-7B-v0.1](https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1). - Explore SeaLLMs: - [SeaLLMs/SeaLLM-7B-v2.5](https://huggingface.co/spaces/SeaLLMs/SeaLLM-7B-v2.5). - [SeaLLMs/SeaLLM-7B-v2](https://huggingface.co/spaces/SeaLLMs/SeaLLM-7B-v2). - [SeaLLMs/SeaLLM-7B-v1](https://huggingface.co/spaces/SeaLLMs/SeaLLM-7B-v1).> **Disclaimer**: > We must note that even though the weights, codes, and demos are released in an open manner, similar to other pre-trained language models, and despite our best efforts in red teaming and safety fine-tuning and enforcement, our models come with potential risks, including but not limited to inaccurate, misleading or potentially harmful generation. > Developers and stakeholders should perform their own red teaming and provide related security measures before deployment, and they must abide by and comply with local governance and regulations. > In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights, codes, or demos. > The logo was generated by DALL-E 3. ## Overview SeaLMMM-7B-v0.1 is a multimodal extension of [SeaLLM-7B-v2](https://huggingface.co/SeaLLMs/SeaLLM-7B-v2). It adopts the [Llava-1.6](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) (Llava-NEXT) architecture. It is trained by jointly train SeaLLM's multilingual text-only datasets along with Llava-1.5 English-only vision data, as well as in-house synthetically generated multilingual multimodal vision data and open-source data, such as [ThaiIDCardSynt](https://huggingface.co/datasets/matichon/ThaiIDCardSynt). ### English Vision QA Tasks | Multimodal Models | VQA2 | GQA | Vizwiz | SQA-IMG | TextQA | --- | --- | --- | --- | --- | --- | | Qwen-VL-Chat | 78.20 | 57.50 | 38.90 | 68.20 | 61.50 | Llava-1.5-7b | 78.50 | 62.00 | 50.00 | 66.80 | 58.20 | Llava-1.5-13b | 80.00 | 63.30 | 53.60 | 71.60 | 61.30 | [SeaLMMM-7B-v0.1](https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1) | 80.14 | 61.58 | 58.00 | 71.79 | 63.47 ### Multilingual Text-only World Knowledge We evaluate models on 3 benchmarks following the recommended default setups: 5-shot MMLU for En, 3-shot [M3Exam](https://arxiv.org/pdf/2306.05179.pdf) (M3e) for En, Zh, Vi, Id, Th. On text-only benchmarks, [SeaLMMM-7B-v0.1](https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1) is generally on-par with [SeaLLM-7B-v2](https://huggingface.co/SeaLLMs/SeaLLM-7B-v2) - its base LLM model. This demonstrates that our multimodal training regime does not vastly degrade text-only performance. | Model | Langs | EnTerms of Use and License: By using our released weights, codes, and demos, you agree to and comply with the terms and conditions specified in our SeaLLMs Terms Of Use.