Abstract
We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.
Community
A technical report from MiniMax. The authors are listed in alphabetical order. The model is open-sourced at https://github.com/MiniMax-AI.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding (2024)
- [CLS] Token Tells Everything Needed for Training-free Efficient MLLMs (2024)
- Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration (2025)
- B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens (2024)
- VisionZip: Longer is Better but Not Necessary in Vision Language Models (2024)
- Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings (2024)
- PruneVid: Visual Token Pruning for Efficient Video Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
We made a deep dive video for this paper: https://www.youtube.com/watch?v=eh7oDAxUoPg. Happy learning ๐ค and stretching ๐ช together!
Oh, and btw, we tried using Minimax for this paper deep dive, but it kept hanging on us ๐ โฆ (maybe our long text + long PDF combo was just too much? shouldn't be thoughโฆor maybe Minimax just doesnโt like deep diving itself?! ๐ค) That said, their PDF-on-the-side feature is super sweet ๐ญ for paper reading and live QA! ๐
Thank you, great model! Here's the summary I published of it:
๐ ๐ถ๐ป๐ถ๐ ๐ฎ๐ '๐ ๐ป๐ฒ๐ ๐ ๐ผ๐ ๐๐๐ ๐ฟ๐ฒ๐ฎ๐ฐ๐ต๐ฒ๐ ๐๐น๐ฎ๐๐ฑ๐ฒ-๐ฆ๐ผ๐ป๐ป๐ฒ๐ ๐น๐ฒ๐๐ฒ๐น ๐๐ถ๐๐ต ๐ฐ๐ ๐๐ผ๐ธ๐ฒ๐ป๐ ๐ฐ๐ผ๐ป๐๐ฒ๐ ๐ ๐น๐ฒ๐ป๐ด๐๐ต ๐ฅ
This work from Chinese startup MiniMax introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach.
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
๐๏ธ MoE with novel hybrid attention:
โฃ Mixture of Experts with 456B total parameters (45.9B activated per token)
โฃ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers
๐ Outperforms leading models across benchmarks while offering vastly longer context:
โฃ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks
โฃ Can efficiently handle 4M token contexts (vs 256K for most other LLMs)
๐ฌ Technical innovations enable efficient scaling:
โฃ Novel expert parallel and tensor parallel strategies cut communication overhead in half
โฃ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%)
๐ฏ Thorough training strategy:
โฃ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge!
Overall, not only is the model impressive, but the technical paper is also really interesting! ๐
It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs.
Model here, allows commercial use <100M monthly users ๐ https://huggingface.co/MiniMaxAI/MiniMax-Text-01
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper