Code elements of the inference program
The Generative NLP tutorial is available in our public GitHub repo here . It includes two main code files, one for training, and one for inference.
-
The
generative_train.py
code supports compiling and training a model. As part of training, you can generate checkpoints. See Code elements of the training program. -
The
generative_inference.py
code supports compiling for inference and performing an inference run.
This doc page explores the generative_inference.py
code.
Introduction
Inference is the process of making predictions on new data using a trained model. Our code for inference is similar to the code for training, we only made some tweaks to the generative_train.py
code. You can expect to make similar tweaks to the training code for your own model.
Several pieces of the code are the same (or slightly different, e.g. different inputs)
-
Configure input arguments for compilation and running inference.
-
Create dummy inputs for graph tracing.
-
Convert the torch tensors (which come from the checkpoint) to SambaTensor instances.
Here’s where the inference model differs:
-
No input data preparation for an inference model.
-
No data loaders, optimizers, or training loop.
-
We call
compile --inference
, which is faster and results in a smaller PEF than compile for training. -
Weights are loaded both from the Hugging Face model (as before) and from the checkpoint that was created during model training.
-
The model code processes the prompts that are used to make predictions (in our examples, movie reviews).
It all starts with main()
To support inference, a SambaNova model must go through both compilation and training. You have a choice:
-
Theoretically, you can run inference using the output of compilation for training (PEF file).
-
However, if you compile with the
--inference
argument, the compiler performs only the forward pass, so the PEF file is much smaller (and compilation is faster).
The following table gives an overview of what main()
does.
Function | Description | See |
---|---|---|
parse_app_args() |
Collect the arguments coming from |
|
|
Download the pretrained model from Hugging Face. We use the |
|
|
Patch some parts of the Hugging Face model to improve performance on RDU. |
|
|
Convert the model to use the SambaFlow framework. The function performs some initialization and related tasks. |
|
|
Convert torch tensor instances to SambaTensor instances. The checkpoint file that we pass in contains data that we then parse to torch tensor instances and then SambaTensor instances. We need SambaTensor instances to run inference on RDU. |
|
|
If users specify |
|
|
If users specifies |
Here’s the code for main().
main() for inference (compile and run)
def main(argv: List[str]) -> None:
# Parse the args
args = parse_app_args(argv=argv, common_parser_fn=add_common_args,
run_parser_fn=add_run_args)
# Download the model from Hugging Face
if args.config_name:
config = AutoConfig.from_pretrained(args.config_name, cache_dir=args.cache_dir)
model = AutoModelForCausalLM.from_config(config)
elif args.model_name_or_path:
model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path,
cache_dir=args.cache_dir)
else:
raise RuntimeError("Must provide --model_name_or_path or --config_name")
# Patch the model here
model = patch_model(model, args)
samba.from_torch_model_(model)
inputs = get_model_trace_inputs(args)
if args.command == 'compile':
samba.session.compile(model, inputs, app_dir=samba.utils.get_file_dir(file),
inference=args.inference)
elif args.command == 'run':
traced_outputs = utils.trace_graph(model, inputs, pef=args.pef)
predictions = generate(args, model, traced_outputs)
print(*predictions, sep=f"\n{'-' * 20}\n")
if name == "main":
main(sys.argv[1:])
Parse input arguments
Users can run generative_inference.py
with input arguments to affect its behavior.
The list of arguments for compilation for inference is shorter than for training because we don’t worry about weights or other training-specific items. Some arguments mean something different, for example, for training, checkpoint_name
is to the checkpoint to generate, during inference, checkpoint_name
is the checkpoint to pass in.
Here’s the code we use to support model-specific arguments for inference:
Adding arguments for compile and run
def add_common_args(parser: argparse.ArgumentParser):
"""Adds common arguments to an ArgumentParser object
Args:
parser (argparse.ArgumentParser): The argument parser object to add arguments to
"""
parser.add_argument('--model_name_or_path',
type=str,
help='Path to pretrained model or model identifier
'from huggingface.co/models')
parser.add_argument('--config_name',
type=str,
help='Path to pretrained model config or model identifier
'from huggingface.co/models')
parser.add_argument('--cache_dir',
type=str,
help='Where to store pretrained models and data downloaded'
'from huggingface.co')
parser.add_argument('--max_seq_length',
type=int,
default=-1,
help='The maximum total input sequence length after tokenization. '
'Data in your data dir will be truncated or padded to this length. ')
parser.add_argument('--examples_to_generate',
type=int,
default=20,
help='The number of prompts to run generation on')
def add_run_args(parser: argparse.ArgumentParser):
"""Adds arguments used at runtime to an argument parser object
Args:
parser (argparse.ArgumentParser): The argument parser object to add arguments to
"""
parser.add_argument(
'--data_dir',
type=str,
help='Path to a .json file, .jsonl file or a directory containing .jsonl files. '
'Each json object should contain a "prompt" key of text used '
'for prompting text generation.')
parser.add_argument('--max_tokens_to_generate',
default=20,
type=int,
help='Maximum number of tokens to generate after each prompt.')
parser.add_argument(
'--checkpoint_name',
type=str,
default='',
help='Path to a checkpoint containing weights with names matching those provided '
'by the --model_name_or_path')
Pull the pretrained models
Hugging Face supports two functions for pulling a model:
-
AutomodelForCausalLM.from_pretrained()
uses the Hugging Face model and its configuration. -
AutoConfig.from_pretrained()
andAutoModelForCausalLM.from_config()
, used together, allow us to use our own configuration file.
Pull the model
if args.model_name_or_path:
model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path,
cache_dir=args.cache_dir)
model.training = True
elif args.config_name:
config = AutoConfig.from_pretrained(args.config_name, cache_dir=args.cache_dir)
# Read dropout rate from config
args.dropout = config.resid_pdrop
model = AutoModelForCausalLM.from_config(config)
For this tutorial, the configuration is stored as configuration/gpt2_small_config.json
. We’ve fine-tuned those numbers to work well on the RDU architecture.
Configuration file: gpt2_small_config.json
{
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_layer": 12,
"n_positions": 1024,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"vocab_size": 50257
}`
Improve efficiency on RDU by patching
Hugging Face supports the concept of patching a model. In our tutorial, we use patching to make the model run more efficiently on RDUs. In the gpt2_patch.py
file, we’ve defined gpt2_patch_helper
, which patches module forward calls within a gpt2-based transformer model.
The model would still run without patching, but the optimization improves performance. The function is the same for training and inference.
Patch the model
def patch_model(model: nn.Module, args: argparse.Namespace) -> nn.Module:
"""Patch the Hugging Face model to make it more efficient when running on RDU.
Args:
model (nn.Module): The Hugging Face model instance
args (argparse.Namespace): The parsed command line args
Returns:
nn.Module: The patched model instance
"""
return gpt2_patch_helper(model,
args.batch_size,
inference=args.inference,
max_length=args.max_seq_length,
max_pef_length=args.max_seq_length)
Convert the model to use the SambaFlow framework
To convert the model to use the SambaFlow framework, we call samba.from_torch_model
. You find the documentation in the API reference.
Create dummy tensors for compilation
The SambaFlow compiler maps the model graph onto an RDU. The compiler traces how the model’s input tensors change shape and produces the final output tensors. For tracing, the compiler doesn’t require actual data, but does need tensors of the same shape.
We achieve this with a call to get_model_trace_inputs()
which:
-
Calls
samba.from_torch_tensor()
, which takes torch tensors as input and generates SambaTensor instances as output. The RDU manipulates SambaTensor instances, not torch tensors. -
Performs other conversion tasks.
In contrast to the same get_model_trace_inputs()
for training, this function does not include labels but does include the attention masks.
The conversion from torch tensor to SambaTensor is a minimum requirement for any model you want to run on RDU. |
Create dummy tensors for tracing
def get_model_trace_inputs(args: argparse.Namespace) -> Tuple[Any]:
"""Get input tensors to use for tracing the model.
Since they're only used for tracing, these tensors are composed of dummy data.
Args:
args (argparse.Namespace): Parsed command line arguments
Returns:
Tuple[Any]: Inputs to use for tracing
"""
batch_size = args.batch_size
length = args.max_seq_length
assert batch_size == 1, "Only batch size 1 is supported at the moment"
# Input IDs
input_ids = torch.randint(0, 5000, (batch_size, length)).int()
input_ids = samba.from_torch_tensor(input_ids, name='input_ids')
# Position IDs
position_ids = torch.arange(length)
position_ids = position_ids.short()
position_ids = samba.from_torch_tensor(
position_ids.unsqueeze(0).expand(input_ids.shape), name='input_position_ids')
# Attention Mask
# Prepare the attention mask for the Hugging Face Module
attention_mask = torch.randint(2, (batch_size, length), dtype=torch.bfloat16)
attention_mask = attention_mask[:, None, :].to(torch.float32)
attention_mask_name = 'attention_mask'
attention_mask = samba.from_torch_tensor(attention_mask, name=attention_mask_name)
# Items in traced_inputs match the order of inputs to forward() for the model
traced_inputs = (input_ids, None, attention_mask, None, position_ids, None, None, None)
return traced_inputs
Define model compilation for inference
SambaSession.compile()
performs the compilation. See the API Reference for details.
We call compile()
with the following arguments:
-
model
is the RDU-ready model created bysamba.from_torch_model()
-
inputs
are returned byget_model_trace_inputs()
. -
app_dir
is the location of the model. -
inference
is deprecated because the user input determines if we compile for inference.
Compile the model
if args.command == "compile":
samba.session.compile(
model,
inputs,
)
Define the inference process
The goal of our tutorial model is inference, the process of making predictions on new data using a trained model, and not generation, the creation of new data using a generative model. However, the GPT-2 model we are using in this tutorial, uses the generate()
function for either inference or generation.
To define the inference process, we call generate()
with the user-specified arguments, the model, and certain outputs of compilation. We then print prediction information.
Perform inference
elif args.command == 'run':
traced_outputs = utils.trace_graph(model, inputs, pef=args.pef)
predictions = generate(args, model, traced_outputs)
print(*predictions, sep=f"\n{'-' * 20}\n")
Generate function overview
The generate()
function replaces the model’s internal forward call with a call to model_rdu_step()
. We patch the Hugging Face forward function with model_rdu_step()
so it always runs on RDU.
Here’s what the function does:
-
Loads checkpoints.
-
Defines a single model step on RDU (
model_rdu_step()
).-
In contrast to the training loop, which defines a separate and more complex
model_step()
function, we definemodel_rdu_step
insidegenerate()
-
Prepares inputs by calling
get_runtime_inputs()
-
In the call to
samba.session.run()
, runs only the forward pass to return the logits. -
Finally, the single step returns a
CausalLMOuputWithCrossAttentions
object.
-
-
Calls
GenerativeDataset()
to convert the validation file, which is in .jsonl format, into a Torch dataset -
Iterates over the validation dataset and returns tensors.
-
As part of each iteration, decodes the generated tokens to generate text output. In the output, a sentiment has been assigned to each movie review. See Compile, fine-tune, and perform inference with a Hugging Face GPT model for example output.
Generate function
def generate(args: argparse.Namespace, model: nn.Module, traced_outputs: Tuple[SambaTensor]):
"""Generate some outputs from the model, hooking into the Hugging Face generate function.
Args:
args (argparse.Namespace): The parsed command line arguments
model (nn.Module): The transformer model instance
traced_outputs (Tuple[SambaTensor]): Output tensors generated by the tracing process
Returns:
List[str]: A list of predictions from the model
"""
# Load the checkpoint
if args.checkpoint_name:
load_checkpoint(model, args.checkpoint_name)
# Define the internal forward pass in terms of session.run
def model_rdu_step(self, *input, **kwargs):
input_id_length = kwargs['input_ids'].shape[1]
samba_inputs = get_runtime_inputs(kwargs, args.max_seq_length)
output_logits = samba.session.run(input_tensors=samba_inputs,
output_tensors=traced_outputs,
hyperparam_dict={'p': 0.0},
section_types=['fwd'])[0]
logits = samba.to_torch(output_logits)[:, :input_id_length, :].float()
return CausalLMOutputWithCrossAttentions(loss=None, logits=logits)
# Replace the model's internal forward call with the RDU step call so
# model_rdu_step is automatically called during generation.
# The Hugging Face model generate function calls the model's forward function
# to generate text. This function runs the model on CPU.
# To make it run on RDU, we patch the forward function with model_rdu_step
base_model_class = model.class
base_model_class.torch_call = base_model_class.call
base_model_class.call = model_rdu_step
# Make a tokenizer. The model checkpoint folder has vocab.json (tokenizer info)
# and merges.txt files
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
# Make a dataset from a .jsonl file or folder of .jsonl files
dataset = GenerativeDataset(args.data_dir)
predictions = []
# Generate predictions from the model
for k, example in enumerate(dataset):
if k >= args.examples_to_generate:
break
# Tokenize inputs
model_inputs = tokenizer(example['prompt'], return_tensors='pt')
input_ids = model_inputs['input_ids']
input_length = input_ids.shape[-1]
# Hook into HF model.generate to generate predictions.
# The above call patching will ensure the model runs on RDU
generated_ids = model.generate(model_inputs['input_ids'],
max_length=input_length + args.max_tokens_to_generate)
generated_text = tokenizer.decode(generated_ids.squeeze(0))
predictions.append(generated_text)
return predictions
get_runtime_inputs() function
As part of generate()
, we call get_runtime_inputs()
. This function converts torch tensors to SambaTensor instances by performing these tasks:
-
Creates input IDs
-
Creates the attention mask
-
Creates position IDs
get_runtime_inputs() function
def get_runtime_inputs(inputs: Dict[str, List[Any]], max_seq_length: int) ->
Sequence[Optional[samba.SambaTensor]]:
"""Given inputs from the dataset, create inputs for samba.session.run.
These inputs must be the same dtype and shape as the compile inputs
Args:
inputs (Dict[str, List[Any]]): Inputs from the data loader
max_seq_length (int): The max sequence length that the PEF supports
Returns:
Sequence[Optional[samba.SambaTensor]]: The named input tensors to use
in running the model
"""
# Create input_ids
input_ids = inputs['input_ids']
# Pad the inputs to the appropriate max sequence length
input_ids = F.pad(input_ids, (0, max_seq_length - input_ids.shape[1]))
input_ids = samba.from_torch_tensor(input_ids.int(), name="input_ids")
# Create attention_mask
attention_mask = inputs['attention_mask']
attention_mask = F.pad(attention_mask, (0, max_seq_length - attention_mask.shape[1]))
attention_mask = attention_mask[:, None, :].to(torch.float32)
attention_mask = samba.from_torch_tensor(attention_mask, name='attention_mask')
# Create position_ids
position_ids_torch = torch.arange(max_seq_length).short()
position_ids = samba.from_torch_tensor
(position_ids_torch.unsqueeze(0).expand(input_ids.shape), name='input_position_ids')
# Runtime traced inputs match the compile time inputs
traced_inputs = (input_ids, None, attention_mask, None, position_ids, None, None, None)
return traced_inputs