MiniMax-Text-01 / README.md
MiniMax-AI's picture
Initial Commit
e1640d7
|
raw
history blame
16.7 kB
MiniMax-Text-01

MiniMax-Text-01

1. Introduction

MiniMax-Text-01 is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token. To better unlock the long context capabilities of the model, MiniMax-Text-01 adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies and innovative compute-communication overlap methods—such as Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, Expert Tensor Parallel (ETP), etc., MiniMax-Text-01's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference. On various academic benchmarks, MiniMax-Text-01 also demonstrates the performance of a top-tier model.

2. Model Architecture

The architecture of MiniMax-Text-01 is briefly described as follows:

  • Total Parameters: 456B
  • Activated Parameters per Token: 45.9B
  • Number Layers: 80
  • Hybrid Attention: a softmax attention is positioned after every 7 lightning attention.
    • Number of attention heads: 64
    • Attention head dimension: 128
  • Mixture of Experts:
    • Number of experts: 32
    • Expert hidden dimension: 9216
    • Top-2 routing strategy
  • Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000
  • Hidden Size: 6144
  • Vocab Size: 200,064

3. Evaluation

Core Academic Benchmarks

Tasks GPT-4o (11-20) Claude-3.5-Sonnet (10-22) Gemini-1.5-Pro (002) Gemini-2.0-Flash (exp) Qwen2.5-72B-Inst. DeepSeek-V3 Llama-3.1-405B-Inst. MiniMax-Text-01
General
MMLU* 85.7 88.3 86.8 86.5 86.1 88.5 88.6 88.5
MMLU-Pro* 74.4 78.0 75.8 76.4 71.1 75.9 73.3 75.7
SimpleQA 39.0 28.1 23.4 26.6 10.3 24.9 23.2 23.7
C-SimpleQA 64.6 56.8 59.4 63.3 52.2 64.8 54.7 67.4
IFEval (avg) 84.1 90.1 89.4 88.4 87.2 87.3 86.4 89.1
Arena-Hard 92.4 87.6 85.3 72.7 81.2 91.4 63.5 89.1
Reasoning
GPQA* (diamond) 46.0 65.0 59.1 62.1 49.0 59.1 50.7 54.4
DROP* (F1) 89.2 88.8 89.2 89.3 85.0 91.0 92.5 87.8
Mathematics
GSM8k* 95.6 96.9 95.2 95.4 95.8 96.7 96.7 94.8
MATH* 76.6 74.1 84.6 83.9 81.8 84.6 73.8 77.4
Coding
MBPP + 76.2 75.1 75.4 75.9 77.0 78.8 73.0 71.7
HumanEval 90.2 93.7 86.6 89.6 86.6 92.1 89.0 86.9

* Evaluated following a 0-shot CoT setting.

Long Benchmarks

4M Needle In A Haystack Test

Ruler

Model 4k 8k 16k 32k 64k 128k 256k 512k 1M
GPT-4o (11-20) 0.970 0.921 0.890 0.888 0.884 - - - -
Claude-3.5-Sonnet (10-22) 0.965 0.960 0.957 0.950 0.952 0.938 - - -
Gemini-1.5-Pro (002) 0.962 0.960 0.960 0.958 0.938 0.917 0.916 0.861 0.850
Gemini-2.0-Flash (exp) 0.960 0.960 0.951 0.957 0.937 0.860 0.797 0.709 -
MiniMax-Text-01 0.963 0.961 0.953 0.954 0.943 0.947 0.945 0.928 0.910

LongBench v2

Model overall easy hard short medium long
Human 53.7 100.0 25.1 47.2 59.1 53.7
w/ CoT
GPT-4o (11-20) 51.4 54.2 49.7 59.6 48.6 43.5
Claude-3.5-Sonnet (10-22) 46.7 55.2 41.5 53.9 41.9 44.4
Deepseek-V3 - - - - - -
Qwen2.5-72B-Inst. 43.5 47.9 40.8 48.9 40.9 39.8
MiniMax-Text-01 56.5 66.1 50.5 61.7 56.7 47.2
w/o CoT
GPT-4o (11-20) 50.1 57.4 45.6 53.3 52.4 40.2
Claude-3.5-Sonnet (10-22) 41.0 46.9 37.3 46.1 38.6 37.0
Deepseek-V3 48.7 - - - - -
Qwen2.5-72B-Inst. 42.1 42.7 41.8 45.6 38.1 44.4
MiniMax-Text-01 52.9 60.9 47.9 58.9 52.6 43.5

MTOB

Context Type no context half book full book Δ half book Δ full book
eng → kalam (ChrF)
GPT-4o (11-20) 9.90 54.30 - 44.40 -
Claude-3.5-Sonnet (10-22) 20.22 53.62 55.65 33.39 35.42
Gemini-1.5-Pro (002) 16.79 53.68 57.90 36.89 41.11
Gemini-2.0-Flash (exp) 12.20 49.50 53.30 37.30 41.10
Qwen-Long 16.55 48.48 45.94 31.92 29.39
MiniMax-Text-01 6.0 51.74 51.60 45.7 45.6
kalam → eng (BLEURT)
GPT-4o (11-20) 33.20 58.30 - 25.10 -
Claude-3.5-Sonnet (10-22) 31.42 59.70 62.30 28.28 30.88
Gemini-1.5-Pro (002) 32.02 61.52 63.09 29.50 31.07
Gemini-2.0-Flash (exp) 33.80 57.50 57.00 23.70 23.20
Qwen-Long 30.13 53.14 32.15 23.01 2.02
MiniMax-Text-01 33.65 57.10 58.00 23.45 24.35

4. Quickstart

Here we provide a simple example of loading the tokenizer and model to generate content.

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, QuantoConfig, GenerationConfig

# load hf config
hf_config = AutoConfig.from_pretrained("MiniMax-Text-01", trust_remote_code=True)

# quantization config, int8 is recommended
quantization_config =  QuantoConfig(
            weights="int8",
            modules_to_not_convert=[
                "lm_head",
                "embed_tokens",
            ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.num_hidden_layers)]
            + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.num_hidden_layers)]
        )

# set device map
device_map = {
    'model.embed_tokens': 'cuda:0',
    'model.norm': f'cuda:{world_size - 1}',
    'lm_head': f'cuda:{world_size - 1}'
}
# assume 8 GPUs
world_size = 8
layers_per_device = hf_config.num_hidden_layers // world_size
for i in range(world_size):
    for j in range(layers_per_device):
        device_map[f'model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("MiniMax-Text-01")
prompt = "Hello!"
messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-Text-01 model."}]},
    {"role": "user", "content": [{"type": "text", "text": prompt}]},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
# tokenize and move to device
model_inputs = tokenizer(text, return_tensors="pt").to("cuda")

# load bfloat16 model, move to device, and apply quantization
quantized_model = AutoModelForCausalLM.from_pretrained(
    "MiniMax-Text-01",
    torch_dtype="bfloat16",
    device_map=device_map,
    quantization_config=quantization_config,
    trust_remote_code=True,
    offload_buffers=True,
)

# generate response
generation_config = GenerationConfig(
    max_new_tokens=20,
    eos_token_id=200020,
    use_cache=True,
)
generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
print(f"generated_ids: {generated_ids}")
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

5. Chatbot & API

For general use and evaluation, we provide a Chatbot with online search capabilities and the online API for developers.

Contact us at [email protected].