Uncorrectable Error replay (Beta)

An uncorrectable error (UE) results when the data that is read from memory does not match the original data that was intended to be stored. The UE replay feature, now in beta, attempts to automatically recover from UEs and continue a training run if the run encounters a UE.

UEs during training runs

Models have a greater chance of encountering a UE if a model is large and runs for a long time. If a UE occurs, the user needs to restart the run from a checkpoint. User-initiated restarts waste time.

User-initiated restarts. There is usually a delay between when the user notices the UE occurred and when the UE stopped the run. For example, if the UE happens at 2 am and it isn’t noticed until 9 am, and the user then has to reload a checkpoint, many hours have passed.
More frequent checkpointing. Because of the risk of UEs, users might save checkpoints more often to ensure that less of the training run is lost if a UE happens. Saving checkpoints can take hours per checkpoint.

How UE replay works

UE replay attempts to catch UEs and to automatically recover from them so the run can continue without user action.

If a UE is replayed, the run log displays a Samba warning Replaying 1-th time, Replaying 2-th time, etc.
If a UE cannot be replayed, the run raises the UE exception and stops. The feature replays only 5 UEs before terminating the run on the 6th UE. User interaction is needed to recover from the 6th UE.

Turn on UE replay

UE replay is disabled by default. You can turn UE replay on with the following run flags:

Enable DRAM UE replay by adding the following flag to your run command: --enable-dram-ue-replay.
Enable scratchpad UE replay by adding the following flag to your run command: --enable-scratchpad-ue-replay.

Since --enable-dram-ue-replay and --enable-scratchpad-ue-replay are run only flags, it is important that these are not included if you are trying to compile a model since they will not be recognized by the argparser.

UE replay limitations

Because UE replay is still in beta, there are certain limitations.

Ineligibility

SambaFlow checks if a training run is ineligible for replay. If the run encounters a UE but the model or run is ineligible, the run errors out.

The model itself can be ineligible for replay. Models that first read from a tensor and then overwrite the tensor during the forward or backward pass are ineligible for UE replay.
The location in the computational graph where the UE occurs can make a run ineligible for replay.

These cases include:

UE occurs in a section other than the FWD, BCKWD, or ZEROGRAD sections
Model contains batchnorm layers
Model contains an embedding layer and uses the SGD optimizer

If a UE is replayed, the replay mechanism introduces a non-determinism into the run because different random seeds are used after the replay occurs. If a run without the UEs is compared to a run that replayed a UE, there is a difference between the two runs after the step when the UE occurred. However, the computed values from both runs are still functionally correct.

Replay of ineligible UE

Before you use the UE replay feature, consider corner cases and actions to take in case there’s a replay of an ineligible UE.

CORNER CASE: There’s a small chance that a UE is replayed when the UE is not replayable. In these instances, the model outputs will be incorrect.

ACTION: Double check the loss curves to make sure the values look reasonable. If the loss values are not expected, check the run log whether a UE was replayed around the time the loss becomes unexpected. This likely indicates that the UE was replayed even though the run should have been ineligible for replay. If that happens, resume the run from the latest good checkpoint and turn off UE replay.

CORNER CASE: The current implementation of UE replay doesn’t catch all instances of model ineligibility.

ACTION: Before using UE replay, ensure that during the forward and backward passes, the model does not first read from a tensor and then overwrite that tensor. For example, if a model has buffers that are updated using the original value, then that model is ineligible for UE replay. If your application first reads from a tensor and then overwrites that tensor, UE replay cannot recover the original value of the tensor for replay and the run produces incorrect values.

Gradient accumulation support

SambaFlow currently supports UE replay with gradient accumulation support only for NLP apps that use --module_name gpt2_pretrain --task_name clm in non-data-parallel mode or normal data-parallel mode.