Part 2 - Increase your training throughput with FSDP activation checkpointing
# Using Checkpoint Activations with Fully Sharded Data Parallel (FSDP) in PyTorch
## Introduction
In this article, we will explore how to use checkpoint activations with Fully Sharded Data Parallel (FSDP) in PyTorch. This technique is particularly useful for optimizing memory usage during training large models. The video transcription provided by Les from Meta-AI walks through the process step-by-step, and we will aim to replicate that explanation in this article while ensuring clarity and readability.
---
## Setting Up the Environment
The first step is to ensure you have the correct version of PyTorch installed. As mentioned in the video, activation checkpointing for FSDP is supported starting from the June 18th nightlies. Therefore, any version of PyTorch built on or after June 18th will work.
You will need to import a few key libraries:
```python
import torch
from functools import partial
```
Additionally, you’ll need the core FSDP imports:
```python
from torch.distributed.fsdp.fully_sharded_data_parallel import (
CheckpointWrapper,
CheckpointImpl,
apply_activation_checkpoints_wrapper,
)
```
These imports are standard for working with FSDP and checkpoint activations.
---
## Understanding Checkpoint Wrapping
The core of the implementation involves creating a **checkpoint wrapper**. This wrapper will be used to wrap layers in your model that you want to checkpoint during training. The process involves:
1. Bringing in the `CheckpointWrapper` class.
2. Using `CheckpointImpl`, an enum with two options: `re_entrant` and `non_re_entrant`. For optimal performance, we’ll use `non_re_entrant`.
3. Defining a function to apply these checkpoint wrappers across your model.
The key trade-off here is between performance and memory usage. The `non_re_entrant` option provides the best performance while still allowing you to free up GPU memory during training.
---
## Applying Checkpoint Wrappers
To identify which layers in your model should be wrapped with checkpoints, we’ll use a lambda function (check function). This function will programmatically walk through your model and check if each layer is of the type you want to wrap. For example:
```python
residual = getattr(torch.nn.Transformer, "residual")
```
This assumes you’re working with a transformer-based model where `residual` layers are the ones you want to checkpoint.
Once you’ve defined your check function, you can apply the checkpoint wrappers using:
```python
apply_activation_checkpoints_wrapper(model, wrapper_fn, check_fn)
```
Here, `model` is your sharded model (initialized with FSDP), `wrapper_fn` is the checkpoint wrapper function, and `check_fn` is the lambda function that identifies which layers to wrap.
---
## Initializing FSDP and Sharding the Model
Before applying the activation checkpoints, you must initialize FSDP and shard your model. This involves setting up parameters like:
- **Sharding strategy**: Decide how you want your model weights distributed across GPUs.
- **Precision policy**: Set the desired precision (e.g., 16-bit or 32-bit).
- **Wrap policies**: Define how layers are wrapped for checkpointing.
Once your model is sharded, you can pass it to the `apply_activation_checkpoints_wrapper` function along with your wrapper and check functions.
---
## Performance Considerations
There’s one main trade-off when using activation checkpoints: a slowdown in training time. You should expect a roughly 20-25% increase in training time because you’re effectively performing double computations (both forward and backward passes). However, this trade-off is more than offset by the memory savings.
From experience with models like T5, checkpoint activations have shown to free up anywhere from **33% to 38% of GPU memory**. This freed-up memory can be leveraged to increase your batch size, which in turn improves training throughput by a factor of **2-3x** or more.
This makes activation checkpoints an invaluable tool for improving both training efficiency and overall performance with FSDP.
---
## Conclusion
Using checkpoint activations with FSDP is a powerful technique for optimizing memory usage during the training of large neural networks. By following the steps outlined in this article, you can implement this optimization effectively while balancing the trade-offs between computation time and resource utilization.