MiniMax-Text-01

1. Introduction

MiniMax-Text-01 is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token. To better unlock the long context capabilities of the model, MiniMax-Text-01 adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies and innovative compute-communication overlap methods—such as Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, Expert Tensor Parallel (ETP), etc., MiniMax-Text-01's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference. On various academic benchmarks, MiniMax-Text-01 also demonstrates the performance of a top-tier model.

2. Model Architecture

The architecture of MiniMax-Text-01 is briefly described as follows:

Total Parameters: 456B
Activated Parameters per Token: 45.9B
Number Layers: 80
Hybrid Attention: a softmax attention is positioned after every 7 lightning attention.
- Number of attention heads: 64
- Attention head dimension: 128
Mixture of Experts:
- Number of experts: 32
- Expert hidden dimension: 9216
- Top-2 routing strategy
Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000
Hidden Size: 6144
Vocab Size: 200,064

3. Evaluation

Core Academic Benchmarks

Tasks	GPT-4o (11-20)	Claude-3.5-Sonnet (10-22)	Gemini-1.5-Pro (002)	Gemini-2.0-Flash (exp)	Qwen2.5-72B-Inst.	DeepSeek-V3	Llama-3.1-405B-Inst.	MiniMax-Text-01
General
MMLU^*	85.7	88.3	86.8	86.5	86.1	88.5	88.6	88.5
MMLU-Pro^*	74.4	78.0	75.8	76.4	71.1	75.9	73.3	75.7
SimpleQA	39.0	28.1	23.4	26.6	10.3	24.9	23.2	23.7
C-SimpleQA	64.6	56.8	59.4	63.3	52.2	64.8	54.7	67.4
IFEval (avg)	84.1	90.1	89.4	88.4	87.2	87.3	86.4	89.1
Arena-Hard	92.4	87.6	85.3	72.7	81.2	91.4	63.5	89.1
Reasoning
GPQA^* (diamond)	46.0	65.0	59.1	62.1	49.0	59.1	50.7	54.4
DROP^* (F1)	89.2	88.8	89.2	89.3	85.0	91.0	92.5	87.8
Mathematics
GSM8k^*	95.6	96.9	95.2	95.4	95.8	96.7	96.7	94.8
MATH^*	76.6	74.1	84.6	83.9	81.8	84.6	73.8	77.4
Coding
MBPP +	76.2	75.1	75.4	75.9	77.0	78.8	73.0	71.7
HumanEval	90.2	93.7	86.6	89.6	86.6	92.1	89.0	86.9

^* Evaluated following a 0-shot CoT setting.

Long Benchmarks

4M Needle In A Haystack Test

Ruler

Model	4k	8k	16k	32k	64k	128k	256k	512k	1M
GPT-4o (11-20)	0.970	0.921	0.890	0.888	0.884	-	-	-	-
Claude-3.5-Sonnet (10-22)	0.965	0.960	0.957	0.950	0.952	0.938	-	-	-
Gemini-1.5-Pro (002)	0.962	0.960	0.960	0.958	0.938	0.917	0.916	0.861	0.850
Gemini-2.0-Flash (exp)	0.960	0.960	0.951	0.957	0.937	0.860	0.797	0.709	-
MiniMax-Text-01	0.963	0.961	0.953	0.954	0.943	0.947	0.945	0.928	0.910

LongBench v2

Model	overall	easy	hard	short	medium	long
Human	53.7	100.0	25.1	47.2	59.1	53.7
w/ CoT
GPT-4o (11-20)	51.4	54.2	49.7	59.6	48.6	43.5
Claude-3.5-Sonnet (10-22)	46.7	55.2	41.5	53.9	41.9	44.4
Deepseek-V3	-	-	-	-	-	-
Qwen2.5-72B-Inst.	43.5	47.9	40.8	48.9	40.9	39.8
MiniMax-Text-01	56.5	66.1	50.5	61.7	56.7	47.2
w/o CoT
GPT-4o (11-20)	50.1	57.4	45.6	53.3	52.4	40.2
Claude-3.5-Sonnet (10-22)	41.0	46.9	37.3	46.1	38.6	37.0
Deepseek-V3	48.7	-	-	-	-	-
Qwen2.5-72B-Inst.	42.1	42.7	41.8	45.6	38.1	44.4
MiniMax-Text-01	52.9	60.9	47.9	58.9	52.6	43.5

MTOB

Context Type	no context	half book	full book	Δ half book	Δ full book
eng → kalam (ChrF)
GPT-4o (11-20)	9.90	54.30	-	44.40	-
Claude-3.5-Sonnet (10-22)	20.22	53.62	55.65	33.39	35.42
Gemini-1.5-Pro (002)	16.79	53.68	57.90	36.89	41.11
Gemini-2.0-Flash (exp)	12.20	49.50	53.30	37.30	41.10
Qwen-Long	16.55	48.48	45.94	31.92	29.39
MiniMax-Text-01	6.0	51.74	51.60	45.7	45.6
kalam → eng (BLEURT)
GPT-4o (11-20)	33.20	58.30	-	25.10	-
Claude-3.5-Sonnet (10-22)	31.42	59.70	62.30	28.28	30.88
Gemini-1.5-Pro (002)	32.02	61.52	63.09	29.50	31.07
Gemini-2.0-Flash (exp)	33.80	57.50	57.00	23.70	23.20
Qwen-Long	30.13	53.14	32.15	23.01	2.02
MiniMax-Text-01	33.65	57.10	58.00	23.45	24.35

4. Quickstart

Here we provide a simple example of loading the tokenizer and model to generate content.

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, QuantoConfig, GenerationConfig

# load hf config
hf_config = AutoConfig.from_pretrained("MiniMax-Text-01", trust_remote_code=True)

# quantization config, int8 is recommended
quantization_config =  QuantoConfig(
            weights="int8",
            modules_to_not_convert=[
                "lm_head",
                "embed_tokens",
            ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.num_hidden_layers)]
            + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.num_hidden_layers)]
        )

# set device map
device_map = {
    'model.embed_tokens': 'cuda:0',
    'model.norm': f'cuda:{world_size - 1}',
    'lm_head': f'cuda:{world_size - 1}'
}
# assume 8 GPUs
world_size = 8
layers_per_device = hf_config.num_hidden_layers // world_size
for i in range(world_size):
    for j in range(layers_per_device):
        device_map[f'model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("MiniMax-Text-01")
prompt = "Hello!"
messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-Text-01 model."}]},
    {"role": "user", "content": [{"type": "text", "text": prompt}]},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
# tokenize and move to device
model_inputs = tokenizer(text, return_tensors="pt").to("cuda")

# load bfloat16 model, move to device, and apply quantization
quantized_model = AutoModelForCausalLM.from_pretrained(
    "MiniMax-Text-01",
    torch_dtype="bfloat16",
    device_map=device_map,
    quantization_config=quantization_config,
    trust_remote_code=True,
    offload_buffers=True,
)

# generate response
generation_config = GenerationConfig(
    max_new_tokens=20,
    eos_token_id=200020,
    use_cache=True,
)
generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
print(f"generated_ids: {generated_ids}")
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

5. Chatbot & API

For general use and evaluation, we provide a Chatbot with online search capabilities and the online API for developers.