Backpropagation best practices

This doc page discusses two topics that are relevant in the context of back propagation.

  • First you learn how to control what tokens are back-proparaged on during training.

  • Then you learn about output grad initialization and how it’s different in PyTorch and SambaFlow.

Control what tokens are back-propagated on during training

SambaNova has defined "token type ids", which assigns a type to each token and has also modified the GPT training code so that the token type ids associated with each token control which tokens the model learns to generate, and which tokens the model attends to, but does not learn to generate.

In our data preparation package, available in our public GitHub repository External link, we always add the meta data of the token type ids for each token. This metadata is what tells the training code to learn to generate tokens in the completion and attend to the tokens in the prompt without learning to generate them.

This document explains how to customize which tokens are back-propagated on and which tokens are not by manually changing the token type ids of the data. The document uses a genetic code example, but is meant to apply to other situations.

Data file format

When you compile or run a model on SambaNova hardware, you’re expected to pass in hdf5 files that each contain two datasets:

input_ids: sequences of tokens ids

token_type_ids: describe the type of each token. The default id assignments are:

  • id=0 for tokens in the prompt. These tokens’s gradient step will be scaled by prompt_loss_weight.

    • If promp_loss_weight is 0.0, then the tokens will not be learned.

    • If prompt_loss_weight is 0.1, then the tokens will be learned less than other tokens.

    • If prompt_loss_weight is 1.0, then the tokens will be learned just as much as other tokens.

  • id=1 for tokens in the completion, the tokens will be learned, that is, back-propagated on.

  • id=2 for <eos> tokens that serve as padding tokens. These tokens will never be back-propagated on

  • id=3 for <eos> tokens at the end of articles that serve as separators, these will always be back-propagated on.

Controlling backpropagation during training

If your dataset is prepared with the proper token type ids as above, you need to include the flags in the below section to properly enable control over what tokens are backpropgated on during training. If you do not include any of the flags below, then all tokens are learned regardless of their token_type_ids.

Training command arguments

You specify whether to use token type id information when you run your model in training mode. If you include --use_token_type_ids, then the specified prompt_loss_weight is used.

  • --use_token_type_ids

    • Case 1: No token_type_ids in the dataset. You do not specify --use_token_type_ids during training. As a result, all tokens are backpropagated and learned.

    • Case 2: There are token_type_ids in the dataset. You do not specify --use_token_type_ids during training. As a result, all tokens are backpropagated and learned.

    • Case 3: No token_type_ids in dataset, and you specify --use_token_type_ids during training. An error results.

    • Case 4: token_type_ids in dataset and you specify --use_token_type_ids during training. Tokens are backpropagated according to --prompt_loss_weight, as discussed next.

  • --prompt_loss_weight. The weight multiplier for the loss on the prompt tokens (tokens with ID=0).

    • With prompt_loss_weight=0.0 the model does not learn to generate prompt tokens.

    • With prompt_loss_weight=0.1 the model slightly learns to generate prompt tokens.

    • With prompt_loss_weight=1.0 the model learns to generate the prompts just as much as the completion.

How to add token_type_ids to hdf5 files

You can add token_type_ids to hdf5 files with code like the following:

import f5py
f = h5py.File(file_path)
f.create_dataset("token_type_ids,
                  data=token_type_ids,
                  dtype='i4',
                  compression='gzip',
                  maxshape=(None, max_seq_length))
  • token_type_ids is an array of the same shape as input_ids

  • entries in token_type_ids represent the id of the token at the same index in the input_ids array

Examples

Let’s look at some examples.

Example 1

The first example uses the following tokenizer, input text, and tokenized text:

Tokenizer: {"SEP":2, GGC":5,"GCC":6,"ATC":7,"GAC":8,"GAA":9,"ATG":10}

Input text → GAA GGC GCC ATG ATC GAC SEP

Tokenized text “Input_ids” = [9, 5, 6, 10, 7, 8, 2]

Assume you use this example with the following token type ids and prompt loss weight:

Token_type_ids` = [0, 0, 0, 1, 1, 1, 3]

--prompt_loss_weight = 0.0

The model will backpropagate and “learn” to generate tokens [10, 7, 8, 2]. It will not learn to generate [9, 5, 6], but these tokens will be in the context of the model when it is learning to generate [10, 7, 8, 2].

In the future, when the model sees tokens [9, 5, 6] it is more likely to generate the sequence [10, 7, 8, 2].

Assume you use this example with the following token type ids and prompt loss weight:

Token_type_ids` = [1, 1 ,0, 0, 1, 1, 3]

--prompt_loss_weight = 0.0

The model will backpropogate and learn to generate tokens [9, 5], [7, 8, 2]. If will not learn to generate [6, 10], but these tokens will be in the context of the model when it is learning to generate [7, 8, 2].

Example 2

The second example uses the following tokenizer, input text, and tokenized text:

Tokenizer: {"SEP":2, GGC":5,"GCC":6,"ATC":7,"GAC":8,"GAA":9,"ATG":10,"ATG":69}

Input text → GAA GGC GCC ATG ATC GAC PAD PAD PAD PAD

Tokenized text “Input_ids” = [9, 5, 6, 10, 7, 8, 2, 69,69,69,69]

Assume you use this example with the following token type ids and prompt loss weight:

Token_type_ids = [0, 0, 0, 1, 1, 1, 3,2,2,2,2]

--prompt_loss_weight = 0.0

The model will backpropagate and learn to generate tokens [10, 7, 8, 2]. It will not learn to generate [9, 5, 6], but these tokens will be in the context of the model when it is learning to generate [10, 7, 8, 2].

In the future when the model sees tokens [9, 5, 6] it is more likely to generate the sequence [10, 7, 8, 2]. It will not back-propagate on the pad token, and, as a result, to generate the PAD token

Training code

Below is the file and function in the training infrastructure that control what tokens are learned.

File: /opt/sambaflow/apps/nlp/transformers_on_rdu/tasks/utils/common_utils.py`

Refer to function loss_scale_of_crossentropy.

The example code assumes that you’ve set up your data to specify how to treat different token_type_ids.

Perform output grad initialization

When the SambaFlow traces a graph for training, it actually traces only the forward (inference) portion of the graph and depends on the compiler to automatically construct the backpropagation graph. This is because a PyTorch model typically only defines the forward graph, because PyTorch maintains the graph and computes backpropagation dynamically with its own autograd.

In PyTorch, the process in PyTorch usually looks like this:

model = PyTorchModel()  # Instantiate the model with trainable parameters
inputs = get_inputs()   # Create some inputs

outputs = model(inputs) # Compute the result of the forward graph
outputs[0].backward()   # Indicate that backpropagation should occur from the
                        # first output, with the default value of 1.0

The backward computation graph is dependent on the output that you decide to call .backward() from. You can call `.backward() from one or multiple outputs, with either the default value of 1.0 or a scalar/tensor value that you provide. If multiple outputs are backpropagated from, the gradients are accumulated according to the structure of the forward graph. This makes the backward graph construction somewhat dynamic — it can differ based how it’s called.

Because SambaFlow doesn’t know which outputs you decide to backpropagate from, the compiler has to assume that all outputs may be backpropagated from. Therefore the default behavior is to construct the backward graph assuming all outputs may be provided a grad value to backpropagate with, so every output is assigned an associated output grad tensor. An output grad tensor is essentially an input to the backward graph that may be provided a value, just like any input to the forward graph may be provied a value.

After the compiler has generated a PEF in training mode with these output grad values, the values are initialized to all zeros after runtime context initialization and can then be set to any value. You can do thisby either setting the value directly via the SambaTensor.sn_grad API or via the init_output_grads API:

outputs = trace_graph(...)

# Using the .sn_grad API:
outputs[0].sn_grad = torch.ones(*outputs[0].shape)           # Set the first output's grad values to 1

# Using the init_output_grads API:
from sambaflow.samba.utils import init_output_grads
output_grads = [torch.randn(*out.shape) for out in outputs]  # Create gradient tensor for each output
init_output_grads(outputs, output_grads)                     # Set each output's output grad value

# Run the graph
samba.session.run(...)                                       # When the backward graph runs, it will use the values set above as inputs