Uncorrectable Error Replay (Beta)

An uncorrectable error (UE) results when the data that is read from memory does not match the original data that was intended to be stored. UEs are caused by interference from background radiation like neutrons and cosmic rays. These errors can occur randomly in the hardware and halt the current run because the location in memory where the UE occurred is corrupted and the data is incorrect.

UE errors during training runs

Typically, long training runs eventually hit a UE, and users need to restart the run from a checkpoint to continue. User-initiated restarts waste time.

User-initiated restarts. There is usually a delay between when the user notices the UE occurred and when the UE stopped the run. For example, if the UE happens at 2 am and it isn’t noticed until 9 am, and the user then has to reload a checkpoint, many hours have passed.
More frequent checkpointing. The risk of UEs causes users to save checkpoints more often to ensure that less of the training run is lost if a UE happens. Saving checkpoints can take hours per checkpoint.

The UE replay feature, now in beta, attempts to automatically recover from UEs and continue a training run if the run encounters a UE. Less time is wasted because the run is interrupted by fewer UEs and the user doesn’t have to save checkpoints as often. With UE replay, a run requires less user interaction while still being accurate in most cases.

How UE replay works

UE replay attempts to catch any UEs that occur and to automatically recover from them so the run can continue without user action. The feature supports recovery for DRAM UEs and scratchpad UEs. UE replay cannot recover from all UEs.

If a UE is replayed, the run log displays a Samba warning Replaying 1-th time, Replaying 2-th time, etc. If a UE cannot be replayed, the run raises the UE exception and stop. Additionally, the feature replays only 5 UEs before terminating the run on the 6th UE. User interaction is needed to recover from the 6th UE.

Turn on UE replay

You can turn UE replay on or off with environment variables.

Enable DRAM UE replay with the environment variable DRAM_UE_REPLAY=1
Enable scratchpad UE replay with the environment variable SCRATCHPAD_UE_REPLAY=1.

UE replay restrictions

Because UE replay is still in beta, there are certain limitations.

Ineligibility

SambaFlow automatically checks for some cases where a training run is ineligible for replay. If the run encounters a UE but the model or run is ineligible, the run errors out automatically.

The model itself can be ineligible for replay. Models that first read from a tensor and then overwrite the tensor during the forward or backward pass are ineligible for UE replay.
The location in the computational graph where the UE occurs can make a run ineligible for replay. For instance, if a UE occurs when running an OPT section, the run cannot be replayed.

These cases include:

UE occurs in a section other than the FWD, BCKWD, or ZEROGRAD sections
Model contains batchnorm layers
Model contains an embedding layer and uses the SGD optimizer

If a UE is replayed, the replay mechanism introduces a non-determinism into the run because different random seeds are used after the replay occurs. This means that if a run without UEs is compared to a run that replayed a UE, there is a difference between the two runs after the step when the UE occurred. However, the computed values from both runs are still functionally correct.

Replay of ineligible UE

Before you use the UE replay feature, consider corner cases and actions to take in case there’s a replay of an ineligible UE.

There’s a small chance that a UE is replayed when the UE is not replayable. In these instances, the model outputs will be incorrect.

ACTION: Double check the loss curves to make sure the values look reasonable. If the loss values are not expected, check the run log whether a UE was replayed around the time the loss becomes unexpected. This likely indicates that the UE was replayed even though the run should have been ineligible for replay. If that happens, resume the run from the latest good checkpoint and turn off UE replay.

The current implementation of UE replay doesn’t catch all instances of model ineligibility.

ACTION: Before using UE replay, ensure that during the forward and backward passes, the model does not first read from a tensor and then overwrite that tensor. For example, if a model has buffers that are updated using the original value, then that model is ineligible for UE replay. If a tensor is read from and then overwritten, UE replay cannot recover the original value of the tensor for replay and the run produces the incorrect values.

For both of these use cases, stop using UE replay by setting the environment variable to 0. See Turn on UE replay.

Gradient Accumulation Support

SambaFlow currently supports UE replay only for NLP apps that use --module_name gpt2_pretrain --task_name clm in non-data-parallel mode or normal data parallel mode.