Control what tokens are back-propagated on during training

SambaNova has defined "token type ids", which assigns a type to each token and has also modified the GPT training code so that the token type ids associated with each token control which tokens the model learns to generate, and which tokens the model attends to, but does not learn to generate.

In our data preparation package, available in our public GitHub repository External link, we always add the meta data of the token type ids for each token. This metadata is what tells the training code to learn to generate tokens in the completion and attend to the tokens in the prompt without learning to generate them.

This document explains how to customize which tokens are back-propagated on and which tokens are not by manually changing the token type ids of the data. The document uses a genetic code example, but is meant to apply to other situations.

Data file format

When you compile or run a model on SambaNova hardware, you’re expected to pass in hdf5 files that each contain two datasets:

input_ids: sequences of tokens ids

token_type_ids: describe the type of each token. The default id assignments are:

  • id=0 for tokens in the prompt. These tokens’s gradient step will be scaled by prompt_loss_weight.

    • If promp_loss_weight is 0.0, then the tokens will not be learned.

    • If prompt_loss_weight is 0.1, then the tokens will be learned less than other tokens.

    • If prompt_loss_weight is 1.0, then the tokens will be learned just as much as other tokens.

  • id=1 for tokens in the completion, the tokens will be learned, that is, back-propagated on.

  • id=2 for <eos> tokens that serve as padding tokens. These tokens will never be back-propagated on

  • id=3 for <eos> tokens at the end of articles that serve as separators, these will always be back-propagated on.

Controlling backpropagation during training

If your dataset is prepared with the proper token type ids as above, you need to include the flags in the below section to properly enable control over what tokens are backpropgated on during training. If you do not include any of the flags below, then all tokens are learned regardless of their token_type_ids.

Training command arguments

You specify whether to use token type id information when you run your model in training mode. If you include --use_token_type_ids, then the specified prompt_loss_weight is used.

  • --use_token_type_ids

    • Case 1: No token_type_ids in the dataset. You do not specify --use_token_type_ids during training. As a result, all tokens are backpropagated and learned.

    • Case 2: There are token_type_ids in the dataset. You do not specify --use_token_type_ids during training. As a result, all tokens are backpropagated and learned.

    • Case 3: No token_type_ids in dataset, and you specify --use_token_type_ids during training. An error results.

    • Case 4: token_type_ids in dataset and you specify --use_token_type_ids during training. Tokens are backpropagated according to --prompt_loss_weight, as discussed next.

  • --prompt_loss_weight. The weight multiplier for the loss on the prompt tokens (tokens with ID=0).

    • With prompt_loss_weight=0.0 the model does not learn to generate prompt tokens.

    • With prompt_loss_weight=0.1 the model slightly learns to generate prompt tokens.

    • With prompt_loss_weight=1.0 the model learns to generate the prompts just as much as the completion.

How to add token_type_ids to hdf5 files

You can add token_type_ids to hdf5 files with code like the following:

import f5py
f = h5py.File(file_path)
f.create_dataset("token_type_ids,
                  data=token_type_ids,
                  dtype='i4',
                  compression='gzip',
                  maxshape=(None, max_seq_length))
  • token_type_ids is an array of the same shape as input_ids

  • entries in token_type_ids represent the id of the token at the same index in the input_ids array

Examples

Let’s look at some examples.

Example 1

The first example uses the following tokenizer, input text, and tokenized text:

Tokenizer: {"SEP":2, GGC":5,"GCC":6,"ATC":7,"GAC":8,"GAA":9,"ATG":10}

Input text → GAA GGC GCC ATG ATC GAC SEP

Tokenized text “Input_ids” = [9, 5, 6, 10, 7, 8, 2]

Assume you use this example with the following token type ids and prompt loss weight:

Token_type_ids` = [0, 0, 0, 1, 1, 1, 3]

--prompt_loss_weight = 0.0

The model will backpropagate and “learn” to generate tokens [10, 7, 8, 2]. It will not learn to generate [9, 5, 6], but these tokens will be in the context of the model when it is learning to generate [10, 7, 8, 2].

In the future, when the model sees tokens [9, 5, 6] it is more likely to generate the sequence [10, 7, 8, 2].

Assume you use this example with the following token type ids and prompt loss weight:

Token_type_ids` = [1, 1 ,0, 0, 1, 1, 3]

--prompt_loss_weight = 0.0

The model will backpropogate and learn to generate tokens [9, 5], [7, 8, 2]. If will not learn to generate [6, 10], but these tokens will be in the context of the model when it is learning to generate [7, 8, 2].

Example 2

The second example uses the following tokenizer, input text, and tokenized text:

Tokenizer: {"SEP":2, GGC":5,"GCC":6,"ATC":7,"GAC":8,"GAA":9,"ATG":10,"ATG":69}

Input text → GAA GGC GCC ATG ATC GAC PAD PAD PAD PAD

Tokenized text “Input_ids” = [9, 5, 6, 10, 7, 8, 2, 69,69,69,69]

Assume you use this example with the following token type ids and prompt loss weight:

Token_type_ids = [0, 0, 0, 1, 1, 1, 3,2,2,2,2]

--prompt_loss_weight = 0.0

The model will backpropagate and learn to generate tokens [10, 7, 8, 2]. It will not learn to generate [9, 5, 6], but these tokens will be in the context of the model when it is learning to generate [10, 7, 8, 2].

In the future when the model sees tokens [9, 5, 6] it is more likely to generate the sequence [10, 7, 8, 2]. It will not back-propagate on the pad token, and, as a result, to generate the PAD token

Training code

Below is the file and function in the training infrastructure that control what tokens are learned.

File: /opt/sambaflow/apps/nlp/transformers_on_rdu/tasks/utils/common_utils.py`

Refer to function loss_scale_of_crossentropy.

The example code assumes that you’ve set up your data to specify how to treat different token_type_ids.