Hello SambaFlow! Compile and run a model
Welcome! In this tutorial, you learn how to compile and run a logreg.py
example model. We use a classic machine learning problem of recognizing hand-written digits.
In this tutorial you:
-
Ensure that your environment is ready to compile and run models.
-
Compile the model to run on the RDU architecture. Compilation generates a PEF file.
-
Do a training run of the model, passing in the generated PEF file.
We discuss the code for this model in Learn about model creation with SambaFlow. |
Prepare your environment
To prepare your environment, you ensure that the SambaFlow package is installed.
Check your SambaFlow installation
You must have the sambaflow
package installed to run this example and any of the tutorial examples.
-
To check if the package is installed, run this command:
-
For Ubuntu Linux
$ dpkg -s sambaflow
-
For Red Hat Enterprise Linux
$ rpm -qi sambaflow
-
-
Examine the output and verify that the SambaFlow version that you are running matches the documentation you are using.
-
If you see a message that
sambaflow
is not installed, contact your system administrator.
Download the model code
The tutorials in this doc set use different code than tutorials included in /opt/sambaflow/apps . Tutorial examples have been updated and streamlined.
|
logreg model code for download
r"""
This logreg example has minimal code that supports defining a model, compiling it, and
doing a training/test run.
In the accompanying tutorial at https://docs.sambanova.ai/developer/latest/getting-started.html,
we show the commands and options.
"""
import argparse
import sys
from typing import Tuple
import torch
import torch.distributed as dist
import torch.nn as nn
import torchvision
import sambaflow.samba.utils as utils
from sambaflow import samba
from sambaflow.samba.utils.argparser import (parse_app_args,
parse_yaml_to_args)
from sambaflow.samba.utils.dataset.mnist import dataset_transform
from sambaflow.samba.utils.pef_utils import get_pefmeta
class LogReg(nn.Module):
""" Define the model architecture
Define the model architecture i.e. the layers in the model and the
number of features in each layer.
Args:
nlin_layer (ivar): Linear layer
criterion (ivar): Cross Entropy loss layer
"""
def __init__(self, num_features: int, num_classes: int, bias: bool):
""" Initialization function for this class
Args:
num_features (int): Number of input features for the model
num_classes (int): Number of output labels the model classifies inputs
bias (bool): _description_??
"""
super().__init__()
self.num_features = num_features
self.num_classes = num_classes
# Linear layer for predicting target class of inputs
self.lin_layer = nn.Linear(in_features=num_features, out_features=num_classes, bias=bias)
# Cross Entropy layer for loss computation
self.criterion = nn.CrossEntropyLoss()
def forward(self, inputs: torch.Tensor, targets: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
""" Forward pass of the model for the given inputs.
The forward pass predicts the class labels for the inputs
and computes the loss between the correct and predicted class labels.
Args:
inputs (torch.Tensor): Input samples in the dataset
targets (torch.Tensor): correct labels for the inputs
Returns:
Tuple[torch.Tensor, torch.Tensor]:The loss and predicted classes of the inputs
"""
out = self.lin_layer(inputs)
loss = self.criterion(out, targets)
return loss, out
def add_args(parser: argparse.ArgumentParser) -> None:
""" Add model-specific arguments.
By default, the compiler and the Samba framework support a set of arguments to compile() and run().
The arguement parser supports adding application-specific arguments.
Args:
parser (argparse.ArgumentParser): SambaNova argument parser.
"""
parser.add_argument('--lr', type=float, default=0.0015, help="Learning rate for training")
parser.add_argument('--momentum', type=float, default=0.0, help="Momentum value for training")
parser.add_argument('--weight-decay', type=float, default=3e-4, help="Weight decay for training")
parser.add_argument('--num-epochs', '-e', type=int, default=1)
parser.add_argument('--num-steps', type=int, default=-1)
parser.add_argument('--num-features', type=int, default=784)
parser.add_argument('--num-classes', type=int, default=10)
parser.add_argument('--yaml-config', default=None, type=str, help='YAML file used with launch_app.py')
parser.add_argument('--data-dir',
'--data-folder',
type=str,
default='mnist_data',
help="The folder to download the MNIST dataset to.")
parser.add_argument('--bias', action='store_true', help='Linear layer will learn an additive bias')
# end args
def prepare_dataloader(args: argparse.Namespace) -> Tuple[torch.utils.data.DataLoader, torch.utils.data.DataLoader]:
""" Preparation of dataloader.
Prep work for training the logreg model with the `MNIST dataset <http://yann.lecun.com/exdb/mnist/>`__:
We'll split the dataset into train and test sets and return the corresponding data loaders.
# :param args: argument specifying the location of the dataset
# :type args: argparse.Namespace
# RK did not prompt to add arg. What's wrong with the function definition?
Returns:
Tuple[torch.utils.data.DataLoader]: Train and test data loaders
"""
# Get the train & test data (images and labels) from the MNIST dataset
train_dataset = torchvision.datasets.MNIST(root=f'{args.data_dir}',
train=True,
transform=dataset_transform(vars(args)),
download=True)
test_dataset = torchvision.datasets.MNIST(root=f'{args.data_dir}',
train=False,
transform=dataset_transform(vars(args)))
# Get the train & test data loaders (input pipeline)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=args.batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=args.batch_size, shuffle=False)
return train_loader, test_loader
def train(args: argparse.Namespace, model: nn.Module, output_tensors: Tuple[samba.SambaTensor]) -> None:
"""
Train the model. At the end of a training loop, the model will be able
to correctly predict the class labels for any input, within a certain
accuracy.
Args:
args (argparse.Namespace): Hyperparameter values and accuracy test behavior controls
model (nn.Module): Model to be trained
output_tensors (Tuple[samba.SambaTensor]): _description_
Returns:
# RK?? _type_: _description_
"""
# Get data loaders for training and test data
train_loader, test_loader = prepare_dataloader(args)
# Total training steps (iterations) per epoch
total_step = len(train_loader)
hyperparam_dict = {"lr": args.lr, "momentum": args.momentum, "weight_decay": args.weight_decay}
# Train and test for specified number of epochs
for epoch in range(args.num_epochs):
avg_loss = 0
# Train the model for all samples in the train data loader
for i, (images, labels) in enumerate(train_loader):
global_step = epoch * total_step + i
if args.num_steps > 0 and global_step >= args.num_steps:
print('Maximum num of steps reached. ')
return None
sn_images = samba.from_torch_tensor(images, name='image', batch_dim=0)
sn_labels = samba.from_torch_tensor(labels, name='label', batch_dim=0)
loss, outputs = samba.session.run(input_tensors=[sn_images, sn_labels],
output_tensors=output_tensors,
hyperparam_dict=hyperparam_dict,
data_parallel=args.data_parallel,
reduce_on_rdu=args.reduce_on_rdu)
# Sync the loss and outputs with host memory
loss, outputs = samba.to_torch(loss), samba.to_torch(outputs)
avg_loss += loss.mean()
# Print loss per 10,000th sample in every epoch
if (i + 1) % 10000 == 0 and args.local_rank <= 0:
print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.num_epochs, i + 1, total_step,
avg_loss / (i + 1)))
# Check the accuracy of the trained model for all samples in the test data loader
# Sync the model parameters with host memory
samba.session.to_cpu(model)
test_acc = 0.0
with torch.no_grad():
correct = 0
total = 0
total_loss = 0
for images, labels in test_loader:
loss, outputs = model(images, labels)
loss, outputs = samba.to_torch(loss), samba.to_torch(outputs)
total_loss += loss.mean()
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum()
test_acc = 100.0 * correct / total
if args.local_rank <= 0:
print(f'Test Accuracy: {test_acc:.2f} Loss: {total_loss.item() / len(test_loader):.4f}')
# if args.acc_test:
# assert args.num_epochs == 1, "Accuracy test only supported for 1 epoch"
# assert test_acc > 91.0 and test_acc < 92.0, "Test accuracy not within specified bounds."
def main(argv):
"""
:param argv: Command line arguments (`compile`, `test`, `run`, `measure-performance` or `measure-sections`)
"""
utils.set_seed(256)
args_cli = parse_app_args(argv=argv, common_parser_fn=add_args)
args_composed = parse_yaml_to_args(args_cli.yaml_config, args_cli) if args_cli.yaml_config else args_cli
# _ = SambaConfig(args_composed, SNConfig).get_all_params()
args = args_composed
# when it is not distributed mode, local rank is -1.
args.local_rank = dist.get_rank() if dist.is_initialized() else -1
# Create random input and output for compilation
ipt = samba.randn(args.batch_size, args.num_features, name='image', batch_dim=0).bfloat16().float()
tgt = samba.randint(args.num_classes, (args.batch_size, ), name='label', batch_dim=0)
inputs = (ipt, tgt)
# RK>>This generates a warning, commenting out for now
# ipt.host_memory = False
# tgt.host_memory = False
# Instantiate the model
model = LogReg(args.num_features, args.num_classes, args.bias)
# Sync model parameters with RDU memory
samba.from_torch_model_(model)
# Instantiate an optimizer if the model will be trained
if args.inference:
optimizer = None
else:
# We use the SGD optimizer to update the weights of the model
optimizer = samba.optim.SGD(model.parameters(),
lr=args.lr,
momentum=args.momentum,
weight_decay=args.weight_decay)
if args.command == "compile":
# Compile the model to generate a PEF (Plasticine Executable Format) binary
samba.session.compile(model,
inputs,
optimizer,
name='logreg_torch',
app_dir=utils.get_file_dir(__file__),
config_dict=vars(args),
pef_metadata=get_pefmeta(args, model))
else:
assert args.command == "run"
# Trace the compiled graph to initialize the model weights and input/output tensors
# for execution on the RDU.
# The PEF required for tracing is the binary generated during compilation
traced_outputs = utils.trace_graph(model, inputs, optimizer, pef=args.pef)
# Train the model on RDU. This is where the model will be trained
# i.e. weights will be learned to fit the input dataset
train(args, model, traced_outputs)
if __name__ == '__main__':
main(sys.argv[1:])
SambaNova recommends that you create your own directory inside your home directory for the tutorial code:
-
Log in to your SambaNova environment.
-
Create a directory for the tutorials, and a subdirectory for lenet.
$ mkdir $HOME/tutorials $ mkdir $HOME/tutorials/logreg
-
Copy the
logreg.py
file that you just downloaded into$HOME/tutorials/logreg
.
Compile and run your first model
This Hello World! example uses the classic machine learning problem of recognizing the hand-written digits in the MNIST dataset.
Look at supported options
Each example and each model has its own set of supported options.
To see all arguments for the logreg
model, change to the directory you created earlier and look at the --help
output:
$ cd $HOME/tutorials/logreg
$ python logreg.py --help
The output looks similar to the following:
usage: logreg.py [-h] {compile,run,test,measure-performance} ...
positional arguments:
{compile,run,test,measure-performance}
different modes of operation
optional arguments:
-h, --help show this help message and exit
The output shows that you can compile and run this model.
The test and measure-performance options are primarily used internally or when working with SambaNova Support.
|
You can drill down and run each command with --help
to see options at that level. For example, run the following command to see options for run
:
$ python logreg.py run --help
In most cases, using the defaults for the optional arguments is best. In Useful arguments for logreg.py we list a few commonly used arguments. |
Prepare data
This tutorial downloads train and test datasets from the internet, so there’s no separate step for preparing data.
If your system does not have access to the internet, you have to download the data to a system that has access and make the files available. See Download model data (Optional).
Compile logreg
When you compile the model, the compiler generates a PEF file that is suitable for running on the RDU architecture. You later pass in that file when you do a training run.
-
Start in the
tutorials/logreg
directory that you created in Download the model code.$ cd $HOME/tutorials/logreg
-
Run the compilation step, passing in the name of the PEF file to be generated. You will later pass in that file when you do a training run.
$ python logreg.py compile --pef-name="logreg"
-
The compiler runs the model and displays progress messages and warnings on screen.
-
You can safely ignore all
info
andwarning
messages. -
If a message says
warning samba
it might indicate a problem with your code. -
For some background, see SambaNova messages and logs.
-
-
When the command returns to the prompt, look for this output, shown toward the end:
-
Compilation succeeded for partition_X_X
shows you that compilation succeeded. -
Logs are generated in …
shows where the log files are located.
-
-
Verify that the PEF file was generated:
$ ls -lh ./out/logreg/logreg.pef
The generated PEF file contains all information that the system needs to do a training run of the model.
Start a logreg training run
When you do a training run, the application uploads the PEF file onto the chip and trains the model with the specified dataset. This example uses the MNIST dataset. The example code downloads the data set automatically.
If your system is disconnected from the Internet you have to manually download the dataset to a system with Internet access and copy the dataset to the system you are running the models on. See Download model data (Optional). |
-
Start a training run of the model with the PEF file that you generated. Use
-e
to specify the number of epochs (default is 1).$ python $HOME/sambaflow-apps/starters/logreg/logreg.py run --num-epochs 2 --pef=out/logreg/logreg.pef
Even one epoch would be enough to train this simple model, but we use
--num-epochs
to see if loss decreases in the second run. The run command:-
Downloads the model data.
-
Returns output that includes the following:
2023-01-25T15:14:06 : [INFO][LIB][1421606]: sn_create_session: PEF File: out/logreg/logreg.pef Log ID initialized to: [snuser1][python][1421606] at /var/log/sambaflow/runtime/sn.log Epoch [1/2], Step [10000/60000], Loss: 0.4634 Epoch [1/2], Step [20000/60000], Loss: 0.4085 Epoch [1/2], Step [30000/60000], Loss: 0.3860 Epoch [1/2], Step [40000/60000], Loss: 0.3702 Epoch [1/2], Step [50000/60000], Loss: 0.3633 Epoch [1/2], Step [60000/60000], Loss: 0.3555 Test Accuracy: 91.54 Loss: 0.3012 Epoch [2/2], Step [10000/60000], Loss: 0.2861 Epoch [2/2], Step [20000/60000], Loss: 0.3065 Epoch [2/2], Step [30000/60000], Loss: 0.3080 Epoch [2/2], Step [40000/60000], Loss: 0.3084 Epoch [2/2], Step [50000/60000], Loss: 0.3076 Epoch [2/2], Step [60000/60000], Loss: 0.3061 Test Accuracy: 91.54 Loss: 0.3001
-
Congratulations! You have run your first model on the SambaNova system! The output shows that the training run is successful and has a very low loss percentage, which decreases over time.
Useful arguments for logreg.py
Each of the example model commands has several arguments. In most cases, the default gives good results.
Arguments for compile
For a list of compile
arguments for use with logreg.py
, run this command:
$ python $HOME/tutorials/logreg/logreg.py compile --help
The command returns a full list of arguments. Here are some useful arguments:
-
--pef-name
— Name of the output file, which has the information for running the model on RDU. -
--n-chips
,--num-tiles
— Number of chips you want to use (from 1 to 8) and the number of tiles on the chip (1, 2, or 4). Default is 1 chip (4 tiles). -
--num-features
— Number of input features (for this model the default is 784) -
--num-classes
— Number of output labels (for this model the default is 10)
Arguments for run
For a list of run
arguments for use with logreg.py
, run this command:
$ python $HOME/tutorials/logreg/logreg.py run --help
The command returns a full list of arguments. Here are some important arguments:
-
-p PEF
The only required argument. A PEF file that was the output from a compile. -
-b BATCH_SIZE, --batch-size BATCH_SIZE
— How many samples to put in one batch. -
-e
,--num-epochs
— How many epochs to run with the model. -
--num-features
,--num-classes
— Input features and output classes for the model. -
--lr
— Learning rate parameter. Decimal fraction between 0 and 1.
Learn more!
-
To understand what the messages to stdout mean, see SambaNova messages and logs.
-
To learn how to run models inside a Python virtual environment, see Use Python virtual environments.
Download model data (Optional)
Only users without internet access perform this task. By default, the application code downloads model data. |
If you run the example on a system that is not connected to the internet, you have to download the model data from a connected system and copy the data to the system where you want to run the model.
-
On a connected system run:
$ mkdir -p /tmp/data/MNIST/raw $ cd /tmp/data/MNIST/raw $ wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz $ wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz $ wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz $ wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
-
Copy the four
.gz
files to the DataScale system and place them in the directory/tmp/data/MNIST/raw
. -
When you later use the
compile
and therun
command, add the--data-folder=/tmp/data
argument.