Controlling What Tokens Are Back-propagated on During Training

Data File Format

When you compile or run a model on SambaNova hardware, you’re expected to pass in files that each contain two datasets:

  • input_ids: sequences of tokens ids

  • token_type_ids: describe the type of each token. The default id assignments are:

    • id=0 for tokens in the prompt, these tokens’s gradient step will be scaled by prompt_loss_weight, so if promp_loss_weight is 0.0 these tokens will not be learned. If prompt_loss_weight is 0.1 they will be learned “less” than other tokens” if prompt_loss_weight is 1.0 they will be learned just as much as other tokens.

    • id=1 for tokens in the completion, these will always be back-propagated on.

    • id=2 for <eos> tokens that serve as padding tokens. These tokens will never be back-propagated on (If you specify --use_token_type_ids)

    • id=3 for <eos> tokens at the end of articles that serve as separators, these will always be back-propagated on.

Controlling Backpropagation During Training

Some good text here

Training Command Arguments

  • --use_token_type_ids

    • Case 1: No token_type_ids in the dataset. You do not specify --use_token_type_ids during training. As a result, all tokens are backprogpogated and learned.

    • Case 2: There are token_type_ids in the dataset. You do not specify --use_token_type_ids during training. As a result, all tokens are backpropogated and learned.

    • Case 3: No token_type_ids in dataset, and you specify --use_token_type_ids during training → ERROR OUT

    • Case 4: token_type_ids in dataset and you specify –use_token_type_ids during training → tokens are packpropogated according to --prompt_loss_weight defined below

  • --prompt_loss_weight. The weight multiplier for the loss on the prompt tokens (tokens with ID=0).

    • With prompt_loss_weight=0.0 the model does not learn to generate prompt tokens.

    • With prompt_loss_weight=0.1 the model slightly learns to generate prompt tokens.

    • With prompt_loss_weight=1.0 the model learns to generate the prompts just as much as the completion.

How to add token_type_ids to hdf5 files

Add a code as below, where token_type_ids is an array of the same shape as input_ids, and entries in token_type_ids represent the id of the token at the same index in the input_ids array.

f.create_dataset("token_type_ids,data=token_type_ids, dtype='i4', compression='gzip', maxshape=(None, max_seq_length))

Examples

Tokenizer: {"SEP":2, GGC":5,"GCC":6,"ATC":7,"GAC":8,"GAA":9,"ATG":10}

Input text → GAA GGC GCC ATG ATC GAC SEP

Tokenized text “Input_ids” = [9, 5, 6, 10, 7, 8, 2]

Token_type_ids` = [0, 0, 0, 1, 1, 1, 3] with –prompt_loss_weight = 0.0 will backpropagate and “learn” to generate tokens [10, 7, 8, 2]. It will not learn to generate [9, 5, 6], but these tokens will be in the context of the model when it is learning to generate [10, 7, 8, 2]. This means that in the future when the model sees tokens [9, 5, 6] it is more likely to generate the following sequence [10, 7, 8, 2].

Token_type_ids = [1, 1 ,0, 0, 1, 1, 3] with –prompt_loss_weight = 0.0 will backpropogate and learn to generate tokens [9, 5], [7, 8, 2]. It will not learn to generate [6, 10], but these tokens will be in the context of the model when it is learning to generate [7, 8, 2].

Tokenizer: {"SEP":2, GGC":5,"GCC":6,"ATC":7,"GAC":8,"GAA":9,"ATG":10,"ATG":69} Input text → GAA GGC GCC ATG ATC GAC PAD PAD PAD PAD Tokenized text “Input_ids” = [9, 5, 6, 10, 7, 8, 2, 69,69,69,69]

Token_type_ids = [0, 0, 0, 1, 1, 1, 3,2,2,2,2] with –prompt_loss_weight = 0.0 will backpropagate and learn to generate tokens [10, 7, 8, 2]. It will not learn to generate [9, 5, 6], but these tokens will be in the context of the model when it is learning to generate [10, 7, 8, 2]. This means that in the future when the model sees tokens [9, 5, 6] it is more likely to generate the sequence [10, 7, 8, 2]. It will not back-propagate on the pad token, and, as a result, to generate the PAD token

Code where loss is scaled

/opt/sambaflow/apps/nlp/transformers_on_rdu/tasks/utils/common_utils.py

Refer to function loss_scale_of_crossentropy

How to store Checkpoints

In the run command, specify --save_steps N to save a checkpoint every N steps.

How to Restore Checkpoints

  1. Pass in checkpoint with –model_name_or_path CKPT_DIR. –config_name is now not required.

  2. In the CKPT_DIR, there must be one config.json, and one pytorch_model.bin