Crash in Fine-tuning

#14
by tanliboy - opened

I consistently ran into this error while fine-tuning the Phi3 small models. I tried multiple fine-tuning recipes but got the same error. These fine-tuning recipes work fine for both Phi3 mini and medium models.

[rank4]:   File "/opt/conda/envs/handbook/lib/python3.10/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop
[rank4]:     tr_loss_step = self.training_step(model, inputs)
[rank4]:   File "/opt/conda/envs/handbook/lib/python3.10/site-packages/transformers/trainer.py", line 3147, in training_step
[rank4]:     self.accelerator.backward(loss)
[rank4]:   File "/opt/conda/envs/handbook/lib/python3.10/site-packages/accelerate/accelerator.py", line 2117, in backward
[rank4]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank4]:   File "/opt/conda/envs/handbook/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
[rank4]:     self.engine.backward(loss, **kwargs)
[rank4]:   File "/opt/conda/envs/handbook/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/opt/conda/envs/handbook/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1936, in backward
[rank4]:     self.optimizer.backward(loss, retain_graph=retain_graph)
[rank4]:   File "/opt/conda/envs/handbook/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/opt/conda/envs/handbook/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2093, in backward
[rank4]:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank4]:   File "/opt/conda/envs/handbook/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank4]:     scaled_loss.backward(retain_graph=retain_graph)
[rank4]:   File "/opt/conda/envs/handbook/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
[rank4]:     torch.autograd.backward(
[rank4]:   File "/opt/conda/envs/handbook/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank4]:     _engine_run_backward(
[rank4]:   File "/opt/conda/envs/handbook/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank4]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:   File "/opt/conda/envs/handbook/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply
[rank4]:     return user_fn(self, *args)
[rank4]:   File "/home/litan/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/f5527db8a43fc9a4bf17c5b754251e1efe1d4ad3/triton_flash_blocksparse_attn.py", l
ine 904, in backward
[rank4]:     return _backward(ctx, do, *backward_layout)[:4]
[rank4]:   File "/home/litan/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/f5527db8a43fc9a4bf17c5b754251e1efe1d4ad3/triton_flash_blocksparse_attn.py", l
ine 655, in _backward
[rank4]:     q, k, v, o, l, m, layout_crow_indices, layout_col_indices = ctx.saved_tensors
[rank4]:   File "/opt/conda/envs/handbook/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1131, in unpack_hook
[rank4]:     raise CheckpointError(
[rank4]: torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Unpack is being triggered for a tensor that was already unpacked once. If you are calling ctx.saved_tensors in back
ward, make sure to do so only once. Otherwise please open an issue with details on your use case.

Hi !
For the torch activation checkpointing, can you try specifying use_reentrant=True and try ?
We've seen this issue because of a weird interaction of the use_reentrant=False in torch's activation checkpointing and the blocksparse kernel (a similar issue is also observed on torch FSDP as well, see this issue ).

Let me know if that helps !

Thank you, @bapatra ! It works now after changing the use_reentrant parameter to True.

According to the documentation, use_reentrant affects how intermediate activations are recorded and recomputed. In my case, I used DeepSpeed ZeRO-3 for distributed training. If this issue is related to inconsistent results after recomputing, will this randomness impact the fine-tuning results?

Microsoft org

So ZeRO-3 is orthogonal to activation checkpointing: ZeRO offloads optimizer / gradient / parameters, while activation checkpointing just prevents you from storing some of the cached activation tensors in the forward pass that are used in the backward, instead recomputing them in the backward.

As per my understanding, use_reentrant=True is the naive way of doing the checkpointing, wherein you don't save any of the intermediate activations or record the compute graph, and instead recompute the entire forward function every single time (so might be inefficient, but is not inexact; see this for a more detailed discussion).

Thanks for the explanation, @bapatra !

tanliboy changed discussion status to closed

Sign up or log in to comment