Compilation, training, and inference

This tutorial takes you a few steps beyond our Hello SambaFlow! tutorial: You also learn about dataset preparation, testing (validation), and inference. The result is a complete end-to-end machine learning workflow:

  1. Check the SambaFlow installation.

  2. Prepare the dataset.

  3. Compile the model.

  4. Train the model.

  5. Test (validate) the model.

  6. Run inference on the model and visually check predictions.

We discuss the code for this model in the compiler reference.

Prepare your environment

To prepare your environment, you:

  • Check your SambaFlow installation.

  • Download the tutorial files from this document.

  • Download the data files from the internet.

Check your SambaFlow installation

You must have the sambaflow package installed to run this example and any of the tutorial examples.

  1. To check if the package is installed, run this command:

    • For Ubuntu Linux

      $ dpkg -s sambaflow
    • For Red Hat Enterprise Linux

      $ rpm -qi sambaflow
  2. Examine the output and verify that the SambaFlow version that you are running matches the documentation you are using.

  3. If you see a message that sambaflow is not installed, contact your system administrator.

Download the model code

The tutorials in this doc set use different code than tutorials included in /opt/sambaflow/apps. Tutorial examples have been updated and streamlined.

For this tutorial, you download several files.

Model code

lenet model code for download
import argparse
import os
from pathlib import Path
from typing import Tuple

import sambaflow.samba.utils as utils
import torch
import torch.nn as nn
import torchvision.transforms as transforms
from mnist_utils import CustomMNIST, write_labels
from sambaflow import __version__ as sambaflow_version
from sambaflow import samba
from sambaflow.samba.utils.argparser import parse_app_args
from sambaflow.samba.utils.pef_utils import get_pefmeta
from torch.utils.data.dataloader import DataLoader


class LeNet(nn.Module):
    """
    LeNet model for MNIST classification.

    Attributes:
        state: Dictionary to hold model's completed_steps and completed_epochs.
        conv1, conv2: Convolutional layers.
        maxpool1, maxpool2: Max pooling layers.
        fc1, fc2, fc3: Fully connected layers.
        criterion: Loss function.
    """

    def __init__(self, num_classes: int) -> None:
        super(LeNet, self).__init__()
        self.state = {"completed_steps": 0, "completed_epochs": 0}

        self.conv1 = nn.Conv2d(in_channels=1,
                               out_channels=6,
                               kernel_size=(3, 3),
                               stride=(1, 1),
                               padding=(1, 1),
                               dilation=(1, 1),
                               bias=False)

        self.maxpool1 = nn.MaxPool2d(kernel_size=(2, 2))

        self.conv2 = nn.Conv2d(in_channels=6,
                               out_channels=16,
                               kernel_size=(3, 3),
                               stride=(1, 1),
                               padding=(1, 1),
                               dilation=(1, 1),
                               bias=False)
        self.maxpool2 = nn.MaxPool2d(kernel_size=(2, 2))

        self.fc1 = nn.Linear(16 * 7 * 7, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, num_classes)

        self.criterion = nn.CrossEntropyLoss()

    def forward(self, inputs, labels):
        """Defines the forward propagation step."""
        x = self.conv1(inputs).relu()
        x = self.maxpool1(x)
        x = self.conv2(x).relu()
        x = self.maxpool2(x)
        x = torch.reshape(x, [x.shape[0], -1])
        x = self.fc1(x).relu()
        x = self.fc2(x).relu()
        out = self.fc3(x)
        loss = self.criterion(out, labels)
        return loss, out


def get_inputs(params) -> Tuple[samba.SambaTensor, samba.SambaTensor]:
    """
    Creates input images and labels to set the model's shape for compilation.

    Args:
        params: A dictionary containing various parameters including 'batch_size' and 'num_classes'.

    Returns:
        A tuple of input images and labels.
    """
    images = samba.randn(params['batch_size'],
                         1,
                         28,
                         28,
                         name='image',
                         batch_dim=0)
    labels = samba.randint(params['num_classes'], (params['batch_size'],),
                           name='label',
                           batch_dim=0)
    return (images, labels)


def get_dataset(dataset_name: str, params):
    """Retrieves the specified dataset after applying necessary transformations.

    Args:
        dataset_name: The name of the dataset to retrieve.
        params: A dictionary containing various parameters including 'data_dir' and 'inference'.

    Returns:
        The requested dataset as a CustomMNIST object.
    """
    transform = transforms.Compose([
        transforms.ToTensor(),
        # norm by mean and var
        transforms.Normalize((0.1307,), (0.3081,)),
        # Reshape image to 1x28x28
        lambda x: x.reshape((1, 28, 28)),
    ])

    data_dir = Path(params['data_dir'])
    img_file = data_dir / (dataset_name + "-images-idx3-ubyte")
    if params['inference']:  # if running for inference there's no labels file
        lbl_file = None
    else:
        lbl_file = data_dir / (dataset_name + "-labels-idx1-ubyte")
    dataset = CustomMNIST(img_file, lbl_file, transform=transform)

    return dataset


def load_checkpoint(model, optimizer, init_ckpt_path: str):
    """
    Loads a checkpoint from a file and initialize the model and optimizer.

    Args:
        model (object): The model to be loaded.
        optimizer (object): The optimizer to be loaded.
        init_ckpt_path (str): The path to the checkpoint file.

    Returns:
        None
    """
    print(f"Loading checkpoint from file {init_ckpt_path}")
    ckpt = torch.load(init_ckpt_path)
    if model:
        print("Loading model...")
        model.load_state_dict(ckpt['model'])
        model.state['completed_steps'] = ckpt['completed_steps']
        model.state['completed_epochs'] = ckpt['completed_epochs']
    if optimizer:
        print("Loading optimizer...")
        optimizer.load_state_dict(ckpt['optimizer'])


def save_checkpoint(model, optimizer, completed_steps, completed_epochs,
                    ckpt_dir):
    """
    Saves the model checkpoint with the given parameters.

    Args:
        model (nn.Module): The model to be saved.
        optimizer (torch.optim.Optimizer): The optimizer to be saved.
        completed_steps (int): The number of completed steps.
        completed_epochs (int): The number of completed epochs.
        ckpt_dir (str): The directory in which to save the checkpoint.

    Returns:
        str: The path of the saved checkpoint.
    """
    ckpt_dir_path = Path(ckpt_dir)
    ckpt_dir_path.mkdir(parents=True, exist_ok=True)

    state_dict = {
        'completed_steps': completed_steps,
        'completed_epochs': completed_epochs,
        'model': model.state_dict(),
        'optimizer': optimizer.state_dict()
    }

    ckpt_path = ckpt_dir_path / (str(completed_steps) + ".pt")
    torch.save(state_dict, ckpt_path)
    return ckpt_path


def log_step(epoch, num_epochs, current_step, total_steps, loss):
    """
    Prints the current step information during training.

    Args:
        epoch (int): The current epoch number.
        num_epochs (int): The total number of epochs.
        current_step (int): The current step number.
        total_steps (int): The total number of steps.
        loss (float): The loss value for the current step.

    Returns:
        None
    """
    print(
        f"Epoch [{epoch}/{num_epochs}], Step [{current_step}/{total_steps}], Loss: {loss:.4f}"
    )


def prepare(model: nn.Module, optimizer, params):
    """
    Prepares the model by loading a checkpoint and tracing the graph.

    Args:
        model (nn.Module): The model to prepare.
        optimizer: The optimizer for the model.
        params: A dictionary of parameters.

    Returns:
        None
    """

    # We need to load the checkpoint first and then trace the graph to sync the weights from CPU to RDU
    if params['init_ckpt_path']:
        load_checkpoint(model, optimizer, params['init_ckpt_path'])
    else:
        print('[WARNING] No valid initial checkpoint has been provided')

    inputs = get_inputs(params)
    utils.trace_graph(model,
                      inputs,
                      optimizer,
                      pef=params['pef'],
                      mapping=params['mapping'])


def train(model: LeNet, optimizer, params) -> None:
    """
    Trains the given model using the specified optimizer and parameters.

    Args:
        model (LeNet): The model to be trained.
        optimizer: The optimizer to be used during training.
        params: A dictionary containing the parameters for training.

    Returns:
        None
    """
    if params['dataset_name'] is None:
        dataset_name = "train"
    else:
        dataset_name = params['dataset_name']
    data_dir = Path(params['data_dir'])
    print(f"Using dataset: {data_dir / dataset_name}")
    train_dataset = get_dataset(dataset_name, params)
    train_loader = DataLoader(train_dataset,
                              batch_size=params['batch_size'],
                              drop_last=True,
                              shuffle=True)

    # Train the model
    current_step = model.state['completed_steps']
    current_epoch = model.state['completed_epochs']
    total_steps = len(train_loader) * params['num_epochs']
    if current_epoch == params['num_epochs']:
        print(
            f"Epochs trained: {current_epoch} is equal to epochs requested: {params['num_epochs']}. Exiting..."
        )
        return
    print("=" * 30)
    print(f"Initial epoch: {current_epoch:3n}, initial step: {current_step:6n}")
    print(
        f"Target epoch:  {params['num_epochs']:3n}, target step:  {total_steps:6n}"
    )
    hyperparam_dict = {
        "lr": params['lr'],
        "momentum": params['momentum'],
        "weight_decay": params['weight_decay']
    }
    for epoch in range(current_epoch + 1, params['num_epochs'] + 1):
        avg_loss = 0
        for i, (images, labels) in enumerate(train_loader):
            sn_images = samba.from_torch_tensor(images,
                                                name='image',
                                                batch_dim=0)
            sn_labels = samba.from_torch_tensor(labels,
                                                name='label',
                                                batch_dim=0)

            loss, outputs = samba.session.run(
                input_tensors=[sn_images, sn_labels],
                output_tensors=model.output_tensors,
                hyperparam_dict=hyperparam_dict,
                data_parallel=params['data_parallel'],
                reduce_on_rdu=params['reduce_on_rdu'])
            loss, outputs = samba.to_torch(loss), samba.to_torch(outputs)
            avg_loss += loss.mean()
            current_step += 1

            if (i + 1) % 100 == 0:
                log_step(epoch, params['num_epochs'], current_step, total_steps,
                         avg_loss / (i + 1))

    current_epoch = epoch

    samba.session.to_cpu(model)
    save_checkpoint(model, optimizer, current_step, current_epoch,
                    params['ckpt_dir'])


def test(model, dataset_name, params):
    """
    Calculates the test accuracy and loss for the given model and dataset.

    Parameters:
        model (object): The model to be tested.
        dataset_name (str): The name of the dataset to be used.
        params (dict): A dictionary of parameters.

    Returns:
        None
    """
    if dataset_name is None:
        dataset_name = "t10k"
    data_dir = Path(params['data_dir'])
    print(f"Using dataset: {data_dir / dataset_name}")
    test_dataset = get_dataset(dataset_name, params)
    test_loader = DataLoader(test_dataset,
                             drop_last=True,
                             batch_size=params['batch_size'])

    samba.session.to_cpu(model)
    test_acc = 0.0
    with torch.no_grad():
        correct = 0
        total = 0
        total_loss = 0
        for images, labels in test_loader:
            loss, outputs = model(images, labels)
            loss, outputs = samba.to_torch(loss), samba.to_torch(outputs)
            total_loss += loss.mean()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum()

        test_acc = 100.0 * correct / total
        print('Test Accuracy: {:.2f}'.format(test_acc),
              ' Loss: {:.4f}'.format(total_loss.item() / (len(test_loader))))


def batch_predict(model, dataset_name: str, params):
    """
    Generates the predictions for a given model on a dataset.

    Args:
        model (object): The trained model to use for prediction.
        dataset_name (str): The name of the dataset to use for prediction.
        params (dict): Additional parameters for the prediction.

    Returns:
        None
    """
    if dataset_name is None:
        dataset_name = "inference"
    data_dir = Path(params['data_dir'])
    print(f"Using dataset: {data_dir / dataset_name}")
    dataset = get_dataset(dataset_name, params)

    loader = DataLoader(dataset,
                        batch_size=params.get('batch_size', 32),
                        drop_last=True,
                        shuffle=False)

    predicted_labels = []
    for _, (images, labels) in enumerate(loader):
        sn_images = samba.from_torch_tensor(images, name='image', batch_dim=0)
        sn_labels = samba.from_torch_tensor(labels, name='label', batch_dim=0)

        loss, predictions = samba.session.run(
            input_tensors=[sn_images, sn_labels],
            output_tensors=model.output_tensors,
            section_types=['fwd'])
        loss, predictions = samba.to_torch(loss), samba.to_torch(predictions)
        _, predicted_indices = torch.max(predictions, axis=1)  # type: ignore

        predicted_labels += predicted_indices.tolist()

    # write to the file in the same format labels are stored
    results_dir = Path(params['results_dir'])
    results_dir.mkdir(parents=True, exist_ok=True)
    write_labels(predicted_labels,
                 str(results_dir / "prediction-labels-idx1-ubyte"))


def add_common_args(parser: argparse.ArgumentParser):
    """
    Adds common arguments to the given ArgumentParser object.

    Args:
        parser (argparse.ArgumentParser): The ArgumentParser object to add the arguments to.

    Returns:
        None
    """
    parser.add_argument('--num-classes',
                        type=int,
                        default=10,
                        help="Number of output classes (default=10)")
    parser.add_argument('--num-features',
                        type=int,
                        default=784,
                        help="Number of input features (default=784)")
    parser.add_argument('--lr',
                        type=float,
                        default=0.1,
                        help="Learning rate (default=0.1)")
    parser.add_argument('-b',
                        '--batch-size',
                        type=int,
                        default=32,
                        help="Batch size (default=32)")
    parser.add_argument('--momentum',
                        type=float,
                        default=0.0,
                        help="Momentum (default=0.0)")
    parser.add_argument('--weight-decay',
                        type=float,
                        default=0.01,
                        help="Weight decay (default=0.01)")
    parser.add_argument('--print-params',
                        action='store_true',
                        default=False,
                        help="Print the model parameters (default=False)")


def add_run_args(parser: argparse.ArgumentParser):
    """
    Add runtime arguments to the parser.

    Args:
        parser (argparse.ArgumentParser): The parser to which the arguments will be added.

    Returns:
        None
    """
    parser.add_argument('-e', '--num-epochs', type=int, default=1)
    parser.add_argument('--log-path', type=str, default='checkpoints')
    parser.add_argument('--test',
                        action="store_true",
                        help="Test the trained model")
    parser.add_argument('--init-ckpt-path',
                        type=str,
                        default='',
                        help='Path to load checkpoint')
    parser.add_argument('--ckpt-dir',
                        type=str,
                        default=os.getcwd(),
                        help='Path to save checkpoint')
    parser.add_argument('--data-dir',
                        type=str,
                        default='./data',
                        help="Directory containing datasets")
    parser.add_argument('--dataset-name',
                        type=str,
                        help="Dataset name: train, t10k, inference, etc.")
    parser.add_argument('--results-dir',
                        type=str,
                        default='./results',
                        help="Directory to store inference results")


def print_params(params):
    """
    Prints the parameters and their values when --print-params is passed.

    Args:
        params (dict): A dictionary containing the parameters and their values.

    Returns:
        None
    """
    for k in sorted(params.keys()):
        print(f"{k}: {params[k]}")


def main():
    args = parse_app_args(dev_mode=True,
                          common_parser_fn=add_common_args,
                          test_parser_fn=add_run_args,
                          run_parser_fn=add_run_args)
    utils.set_seed(42)
    params = vars(args)
    if args.print_params:
        print_params(params)

    model = LeNet(args.num_classes)
    samba.from_torch_model_(model)

    inputs = get_inputs(params)

    optimizer = samba.optim.SGD(model.parameters(),
                                lr=0.0) if not args.inference else None
    if args.command == "compile":
        pef_metadata=get_pefmeta(args, model)
        pef_metadata['sambaflow_version'] = sambaflow_version
        pef_metadata['input_shapes'] = (inputs[0].shape, inputs[1].shape)
        samba.session.compile(model,
                              inputs,
                              optimizer,
                              name='lenet',
                              app_dir=utils.get_file_dir(__file__),
                              squeeze_bs_dim=True,
                              config_dict=vars(args),
                              pef_metadata=pef_metadata)

    elif args.command == "test":
        print("Test is not implemented in this version.")
    elif args.command == "run":
        if args.inference:
            prepare(model, optimizer, params)
            batch_predict(model, params['dataset_name'], params)
        elif args.test:
            prepare(model, optimizer, params)
            test(model, params['dataset_name'], params)
        else:
            prepare(model, optimizer, params)
            train(model, optimizer, params)


if __name__ == '__main__':
    main()

Utility file

The Python source for lenet model includes mnist_utils.py, which includes:

  • A CustomMNIST class that you will use to create a dataset from Fashion MNIST.

  • A write_labels() function for writing labels (you’ll use it to store predictions — the labels will be Positive and Negative).

mnist_utils.py for download
import struct
from array import array
from typing import Callable, Optional, Tuple, List
import numpy as np
from torch.utils.data import Dataset
from pathlib import Path


class CustomMNIST(Dataset):
    """Custom dataset that is compatible with MNIST format."""

    def __init__(self,
                 images_path: Path,
                 labels_path: Optional[Path],
                 transform: Optional[Callable] = None) -> None:
        """Initialize the dataset with images and labels path, and a transform.

        Args:
            images_path (str): Path to the images file.
            labels_path (str): Path to the labels file.
            transform (Optional[Callable]): Optional transform to be applied on an image.
        """
        self.images = self.read_images(images_path)
        if labels_path == None:  # in case of inference
            self.labels = [0] * len(self.images)
        else:
            self.labels = self.read_labels(labels_path)
        self.transform = transform

    def __getitem__(self, idx: int) -> Tuple[np.array, int]:
        """Fetch an image-label pair by index.

        Args:
            idx (int): Index to the data.

        Returns:
            Tuple[np.array, int]: A tuple containing an image and its corresponding label.
        """
        label = self.labels[idx]
        image = self.images[idx]
        if self.transform:
            image = self.transform(image)
        return image, label

    def __len__(self) -> int:
        """Get the size of the dataset.

        Returns:
            int: The number of images in the dataset.
        """
        return len(self.labels)

    @staticmethod
    def check_magic_number(magic: int, expected: int) -> None:
        """Check if the magic number matches the expected number.

        Args:
            magic (int): The actual magic number.
            expected (int): The expected magic number.
        """
        if magic != expected:
            raise ValueError(
                f"Magic number is wrong: expected {expected}, got {magic}")

    @staticmethod
    def read_labels(file_path: Path) -> List[int]:
        """Read labels from an MNIST-formatted file.

        Args:
            file_path (Path): Path to the file to read from.

        Returns:
            List[int]: List of labels.
        """
        file_data = file_path.read_bytes()
        magic, size = struct.unpack(">II", file_data[:8])
        CustomMNIST.check_magic_number(magic, 2049)
        return list(array("B", file_data[8:]))

    @staticmethod
    def read_images(file_path: Path) -> List[np.array]:
        """Read images from an MNIST-formatted file.

        Args:
            file_path (Path): Path to the file to read from.

        Returns:
            List[np.array]: List of images.
        """
        file_data = file_path.read_bytes()
        magic, size, rows, cols = struct.unpack(">IIII", file_data[:16])
        CustomMNIST.check_magic_number(magic, 2051)
        image_data = array("B", file_data[16:])

        return [
            np.array(image_data[i * rows * cols:(i + 1) * rows * cols]).reshape(
                28, 28) for i in range(size)
        ]


def write_labels(labels: List[int], filename: str) -> None:
    """Write labels to an MNIST-formatted file.

    Args:
        labels (List[int]): List of labels.
        filename (str): Filename to write to.
    """
    with open(filename, 'wb') as f:
        f.write(struct.pack('>ii', 2049, len(labels)))
        f.writelines(struct.pack('B', label) for label in labels)

Jupyter notebook file for inference verification

For the inference step, we’ll use a Jupyter notebook file.

lenet_inference.ipynb for download
{
    "cells": [
     {
      "cell_type": "markdown",
      "id": "ec5e46fe",
      "metadata": {},
      "source": [
       "# Run inference on RDU\n",
       "\n",
       "In this example you run inference on RDU using the LeNet model.\n",
       "You generate predictions and then visualize them.\n",
       "The images and prediction labels files use the same format as the MNIST files.\n",
       "\n",
       "You randomly choose images and their predicted labels from the dataset and visualize them.\n",
       "Then you visually estimate the accuracy by inspecting the images and comparing to the predicted labels."
      ]
     },
     {
      "cell_type": "markdown",
      "id": "5672b5b8",
      "metadata": {},
      "source": [
       "## Model parameters\n",
       "\n",
       "First, define some parameters to use with the model.\n",
       "\n",
       "* `data_dir`: where your dataset is stored\n",
       "* `dataset_name`: the dataset name you are going to use for inference\n",
       "* `results_dir`: where the prediction labels will be stored\n",
       "* `inir_ckpt_path`: the checkpoint file that we created during training\n",
       "* `batch_size`: batch size; should match the batch size in the PEF file\n",
       "* `num_classes`: number of classes (categories); usually 10 for MNIST datasets\n",
       "* `num_features`: number of pixels in the input images; usually 28x28 = 784\n",
       "* `pef`: the PEF file that was compiled for this model; could be the same we used in training or compiled for inference only\n",
       "* `mapping`: use `section` here\n",
       "* `inference`: set to True (obviously)"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": 2,
      "id": "f4652a48",
      "metadata": {},
      "outputs": [],
      "source": [
       "params = {\n",
       "    \"data_dir\": \"/home/pavela/datasets/fashion-mnist\",\n",
       "    \"dataset_name\": \"t10k\",\n",
       "    \"results_dir\": \"/home/pavela/results/lenet/\",\n",
       "    \"init_ckpt_path\": \"/home/pavela/checkpoints/lenet/checkpoints-b32/1875.pt\",\n",
       "    \"batch_size\": 32,\n",
       "    \"num_classes\": 10,\n",
       "    \"num_features\": 784,\n",
       "    \"pef\": \"/home/pavela/sambaflow-apps/starters/lenet/out/lenet-b32/lenet-b32.pef\",\n",
       "    \"mapping\": \"section\",\n",
       "    \"inference\": True,\n",
       "}"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "064ea7bf",
      "metadata": {},
      "source": [
       "## Imports\n",
       "\n",
       "Some necessary imports.\n",
       "\n",
       "Note: sometimes you have to execute the cell below 2 or 3 times to eliminate the import errors."
      ]
     },
     {
      "cell_type": "code",
      "execution_count": 15,
      "id": "356a1f89",
      "metadata": {},
      "outputs": [],
      "source": [
       "import os\n",
       "from pathlib import Path\n",
       "from typing import Tuple, Optional, Callable, List\n",
       "from sys import exit\n",
       "\n",
       "import torch\n",
       "import torch.nn as nn\n",
       "import torchvision.transforms as transforms\n",
       "import sambaflow\n",
       "from sambaflow import samba\n",
       "from torch.utils.data.dataloader import DataLoader\n",
       "from mnist_utils import CustomMNIST, write_labels\n",
       "from sambaflow.samba.utils.utils import trace_graph\n",
       "from sambaflow.logging import samba_logger"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "42513f76",
      "metadata": {},
      "source": [
       "## Set the log level"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": 16,
      "id": "e9747681",
      "metadata": {},
      "outputs": [],
      "source": [
       "# Set the log level\n",
       "samba_logger.set_level('error')"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "f59da2a1",
      "metadata": {},
      "source": [
       "## Define the model\n",
       "\n",
       "Use the same model class that was used in training."
      ]
     },
     {
      "cell_type": "code",
      "execution_count": 10,
      "id": "32d07117",
      "metadata": {},
      "outputs": [],
      "source": [
       "class LeNet(nn.Module):\n",
       "    \"\"\"\n",
       "    LeNet model for MNIST classification.\n",
       "\n",
       "    Attributes:\n",
       "        state: Dictionary to hold model's completed_steps and completed_epochs.\n",
       "        conv1, conv2: Convolutional layers.\n",
       "        maxpool1, maxpool2: Max pooling layers.\n",
       "        fc1, fc2, fc3: Fully connected layers.\n",
       "        criterion: Loss function.\n",
       "    \"\"\"\n",
       "\n",
       "    def __init__(self, num_classes: int) -> None:\n",
       "        super(LeNet, self).__init__()\n",
       "        self.state = {\"completed_steps\": 0, \"completed_epochs\": 0}\n",
       "\n",
       "        self.conv1 = nn.Conv2d(\n",
       "            in_channels=1,\n",
       "            out_channels=6,\n",
       "            kernel_size=(3, 3),\n",
       "            stride=(1, 1),\n",
       "            padding=(1, 1),\n",
       "            dilation=(1, 1),\n",
       "            bias=False,\n",
       "        )\n",
       "\n",
       "        self.maxpool1 = nn.MaxPool2d(kernel_size=(2, 2))\n",
       "\n",
       "        self.conv2 = nn.Conv2d(\n",
       "            in_channels=6,\n",
       "            out_channels=16,\n",
       "            kernel_size=(3, 3),\n",
       "            stride=(1, 1),\n",
       "            padding=(1, 1),\n",
       "            dilation=(1, 1),\n",
       "            bias=False,\n",
       "        )\n",
       "        self.maxpool2 = nn.MaxPool2d(kernel_size=(2, 2))\n",
       "\n",
       "        self.fc1 = nn.Linear(16 * 7 * 7, 120)\n",
       "        self.fc2 = nn.Linear(120, 84)\n",
       "        self.fc3 = nn.Linear(84, num_classes)\n",
       "\n",
       "        self.criterion = nn.CrossEntropyLoss()\n",
       "\n",
       "    def forward(self, inputs, labels):\n",
       "        \"\"\"Defines the forward propagation step.\"\"\"\n",
       "        x = self.conv1(inputs).relu()\n",
       "        x = self.maxpool1(x)\n",
       "        x = self.conv2(x).relu()\n",
       "        x = self.maxpool2(x)\n",
       "        x = torch.reshape(x, [x.shape[0], -1])\n",
       "        x = self.fc1(x).relu()\n",
       "        x = self.fc2(x).relu()\n",
       "        out = self.fc3(x)\n",
       "        loss = self.criterion(out, labels)\n",
       "        return loss, out"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "ce44bab5",
      "metadata": {},
      "source": [
       "## Prepare the model\n",
       "\n",
       "The following functions prepare the model: they load the checkpoint that we created during training and trace the model's graph."
      ]
     },
     {
      "cell_type": "code",
      "execution_count": 11,
      "id": "180e2132",
      "metadata": {},
      "outputs": [],
      "source": [
       "def load_checkpoint(model, optimizer, init_ckpt_path: str):\n",
       "    \"\"\"\n",
       "    Loads a checkpoint from a file and initialize the model and optimizer.\n",
       "\n",
       "    Args:\n",
       "        model (object): The model to be loaded.\n",
       "        optimizer (object): The optimizer to be loaded.\n",
       "        init_ckpt_path (str): The path to the checkpoint file.\n",
       "\n",
       "    Returns:\n",
       "        None\n",
       "    \"\"\"\n",
       "    print(f\"Loading checkpoint from file {init_ckpt_path}\")\n",
       "    ckpt = torch.load(init_ckpt_path)\n",
       "    if model:\n",
       "        print(\"Loading model...\")\n",
       "        model.load_state_dict(ckpt[\"model\"])\n",
       "        model.state[\"completed_steps\"] = ckpt[\"completed_steps\"]\n",
       "        model.state[\"completed_epochs\"] = ckpt[\"completed_epochs\"]\n",
       "    if optimizer:\n",
       "        print(\"Loading optimizer...\")\n",
       "        optimizer.load_state_dict(ckpt[\"optimizer\"])\n",
       "\n",
       "\n",
       "def get_inputs(params) -> Tuple[samba.SambaTensor, samba.SambaTensor]:\n",
       "    \"\"\"\n",
       "    Creates input images and labels to set the model's shape for compilation.\n",
       "\n",
       "    Args:\n",
       "        params: A dictionary containing various parameters including 'batch_size' and 'num_classes'.\n",
       "\n",
       "    Returns:\n",
       "        A tuple of input images and labels.\n",
       "    \"\"\"\n",
       "    images = samba.randn(params[\"batch_size\"], 1, 28, 28, name=\"image\", batch_dim=0)\n",
       "    labels = samba.randint(\n",
       "        params[\"num_classes\"], (params[\"batch_size\"],), name=\"label\", batch_dim=0\n",
       "    )\n",
       "    return (images, labels)\n",
       "\n",
       "\n",
       "def prepare(model: nn.Module, optimizer, params):\n",
       "    \"\"\"\n",
       "    Prepares the model by loading a checkpoint and tracing the graph.\n",
       "\n",
       "    Args:\n",
       "        model (nn.Module): The model to prepare.\n",
       "        optimizer: The optimizer for the model.\n",
       "        params: A dictionary of parameters.\n",
       "\n",
       "    Returns:\n",
       "        None\n",
       "    \"\"\"\n",
       "\n",
       "    # We need to load the checkpoint first and then trace the graph to sync the weights from CPU to RDU\n",
       "    if params[\"init_ckpt_path\"]:\n",
       "        load_checkpoint(model, optimizer, params[\"init_ckpt_path\"])\n",
       "    else:\n",
       "        return False\n",
       "\n",
       "    inputs = get_inputs(params)\n",
       "\n",
       "    trace_graph(model, inputs, optimizer, pef=params[\"pef\"], mapping=params[\"mapping\"])\n",
       "    return True"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "60dff13a",
      "metadata": {},
      "source": [
       "## Create a custom dataset\n",
       "\n",
       "Create a custom dataset using the PyTorch's Dataset class.\n",
       "You can read more about that here: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files\n",
       "\n",
       "The `CustomMNIST` class is described in the `mnist_utils.py` file and imported above."
      ]
     },
     {
      "cell_type": "code",
      "execution_count": 12,
      "id": "91417821",
      "metadata": {},
      "outputs": [],
      "source": [
       "def get_dataset(dataset_name: str, params):\n",
       "    \"\"\"Retrieves the specified dataset after applying necessary transformations.\n",
       "\n",
       "    Args:\n",
       "        dataset_name: The name of the dataset to retrieve.\n",
       "        params: A dictionary containing various parameters including 'data_dir' and 'inference'.\n",
       "\n",
       "    Returns:\n",
       "        The requested dataset as a CustomMNIST object.\n",
       "    \"\"\"\n",
       "    transform = transforms.Compose(\n",
       "        [\n",
       "            transforms.ToTensor(),\n",
       "            # norm by mean and var\n",
       "            transforms.Normalize((0.1307,), (0.3081,)),\n",
       "            # Reshape image to 1x28x28\n",
       "            lambda x: x.reshape((1, 28, 28)),\n",
       "        ]\n",
       "    )\n",
       "\n",
       "    data_dir = Path(params[\"data_dir\"])\n",
       "    img_file = data_dir / (dataset_name + \"-images-idx3-ubyte\")\n",
       "    if params[\"inference\"]:  # if running for inference there's no labels file\n",
       "        lbl_file = None\n",
       "    else:\n",
       "        lbl_file = data_dir / (dataset_name + \"-labels-idx1-ubyte\")\n",
       "    dataset = CustomMNIST(img_file, lbl_file, transform=transform)\n",
       "\n",
       "    return dataset"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "f9800bdc",
      "metadata": {},
      "source": [
       "## Create the inference function \n",
       "\n",
       "This function runs inference on the RDU to predict image labels using our trained model."
      ]
     },
     {
      "cell_type": "code",
      "execution_count": 13,
      "id": "2a883e06",
      "metadata": {},
      "outputs": [],
      "source": [
       "def batch_predict(model, dataset_name: str, params):\n",
       "    \"\"\"\n",
       "    Generates the predictions for a given model on a dataset.\n",
       "\n",
       "    Args:\n",
       "        model (object): The trained model to use for prediction.\n",
       "        dataset_name (str): The name of the dataset to use for prediction.\n",
       "        params (dict): Additional parameters for the prediction.\n",
       "\n",
       "    Returns:\n",
       "        None\n",
       "    \"\"\"\n",
       "    if dataset_name is None:\n",
       "        dataset_name = \"inference\"\n",
       "    data_dir = Path(params[\"data_dir\"])\n",
       "    print(f\"Using dataset: {data_dir / dataset_name}\")\n",
       "    dataset = get_dataset(dataset_name, params)\n",
       "\n",
       "    loader = DataLoader(\n",
       "        dataset, batch_size=params.get(\"batch_size\", 32), drop_last=True, shuffle=False\n",
       "    )\n",
       "\n",
       "    predicted_labels = []\n",
       "    for _, (images, labels) in enumerate(loader):\n",
       "        sn_images = samba.from_torch_tensor(images, name=\"image\", batch_dim=0)\n",
       "        sn_labels = samba.from_torch_tensor(labels, name=\"label\", batch_dim=0)\n",
       "\n",
       "        loss, predictions = samba.session.run(\n",
       "            input_tensors=[sn_images, sn_labels],\n",
       "            output_tensors=model.output_tensors,\n",
       "            section_types=[\"fwd\"],\n",
       "        )\n",
       "        loss, predictions = samba.to_torch(loss), samba.to_torch(predictions)\n",
       "        _, predicted_indices = torch.max(predictions, axis=1)\n",
       "\n",
       "        predicted_labels += predicted_indices.tolist()\n",
       "\n",
       "    return predicted_labels"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "acce38a5",
      "metadata": {},
      "source": [
       "## Run inference"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": 14,
      "id": "f876c09e",
      "metadata": {},
      "outputs": [
       {
        "name": "stdout",
        "output_type": "stream",
        "text": [
         "Loading checkpoint from file /home/pavela/checkpoints/lenet/checkpoints-b32/1875.pt\n",
         "Loading model...\n",
         "Using dataset: /home/pavela/datasets/fashion-mnist/t10k\n"
        ]
       }
      ],
      "source": [
       "# Reset the Samba session\n",
       "samba.session.reset(params[\"pef\"])\n",
       "# Create the model for 10 output classes\n",
       "model = LeNet(10)\n",
       "# Copy the model to RDU\n",
       "samba.from_torch_model_(model)\n",
       "# The optimizer is set to None because we are running inference\n",
       "optimizer = None\n",
       "# We need a valid checkpoint to run inference\n",
       "if prepare(model, optimizer, params):\n",
       "    predicted_labels = batch_predict(model, params[\"dataset_name\"], params)\n",
       "else:\n",
       "    print(\"No valid checkpoint has been provided. Can't run prediction.\")"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "ac979eb5",
      "metadata": {},
      "source": [
       "## Save the predicted labels\n",
       "\n",
       "Save the predicted labels in a file in the same format that MNIST uses to provide training labels."
      ]
     },
     {
      "cell_type": "code",
      "execution_count": 31,
      "id": "7c1a7d2a",
      "metadata": {},
      "outputs": [],
      "source": [
       "results_dir = Path(params[\"results_dir\"])\n",
       "results_dir.mkdir(parents=True, exist_ok=True)\n",
       "write_labels(predicted_labels, str(results_dir / \"prediction-labels-idx1-ubyte\"))"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "562d8599",
      "metadata": {},
      "source": [
       "## Create a custom dataset with predicted labels\n",
       "\n",
       "Read the images from our inference dataset and predicted labels from the file generated by the model."
      ]
     },
     {
      "cell_type": "code",
      "execution_count": 42,
      "id": "467ea79f",
      "metadata": {},
      "outputs": [],
      "source": [
       "data_dir = Path(params[\"data_dir\"])\n",
       "predict_images_file = data_dir / \"t10k-images-idx3-ubyte\"\n",
       "predict_labels_file = results_dir / \"prediction-labels-idx1-ubyte\""
      ]
     },
     {
      "cell_type": "markdown",
      "id": "75e7328f",
      "metadata": {},
      "source": [
       "Create a custom PyTorch dataset using the inference images, predicted labels, and the `data_transform` function."
      ]
     },
     {
      "cell_type": "code",
      "execution_count": 44,
      "id": "6ec14f3c",
      "metadata": {},
      "outputs": [],
      "source": [
       "data_transform = transforms.Compose(\n",
       "    [transforms.ToTensor(), transforms.Normalize(mean=(0.1307,), std=(0.3081,))]\n",
       ")\n",
       "\n",
       "predict_dataset = CustomMNIST(\n",
       "    predict_images_file, predict_labels_file, transform=data_transform\n",
       ")"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "28b5e7df",
      "metadata": {},
      "source": [
       "## Create a display function"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": 34,
      "id": "4688296f",
      "metadata": {},
      "outputs": [],
      "source": [
       "import matplotlib.pyplot as plt\n",
       "\n",
       "\n",
       "def plot_examples(images, labels, columns=5):  # by default set rows=1\n",
       "    rows = len(images) // (columns)\n",
       "    fig = plt.figure(figsize=(6.4, 2 * rows))\n",
       "    for i, (img, lbl) in enumerate(zip(images, labels)):\n",
       "        if i < columns * rows:\n",
       "            ax = fig.add_subplot(rows, columns, i + 1)\n",
       "            ax.imshow(img.reshape(28, 28), cmap=\"gray\")\n",
       "            ax.set_xticks([])  # set empty label for x axis\n",
       "            ax.set_yticks([])  # set empty label for y axis\n",
       "            ax.set_title(lbl)\n",
       "    plt.tight_layout()\n",
       "    return fig"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "082af9b7",
      "metadata": {},
      "source": [
       "## Display the images and predicted labels\n",
       "\n",
       "Pick several (`sample_size`) images and their predicted labels randomly and display them. We use the Fashion MNIST dataset labels here."
      ]
     },
     {
      "cell_type": "code",
      "execution_count": 45,
      "id": "d29b1874",
      "metadata": {},
      "outputs": [
       {
        "data": {
         "image/png": "\n",
         "text/plain": [
          "<Figure size 460.8x576 with 20 Axes>"
         ]
        },
        "metadata": {},
        "output_type": "display_data"
       }
      ],
      "source": [
       "import random\n",
       "\n",
       "images = []\n",
       "labels = []\n",
       "sample_size = 20\n",
       "\n",
       "fashion_items = [\n",
       "    \"T-shirt/top\",\n",
       "    \"Trouser\",\n",
       "    \"Pullover\",\n",
       "    \"Dress\",\n",
       "    \"Coat\",\n",
       "    \"Sandal\",\n",
       "    \"Shirt\",\n",
       "    \"Sneaker\",\n",
       "    \"Bag\",\n",
       "    \"Ankle boot\",\n",
       "]\n",
       "\n",
       "for i in range(0, sample_size):\n",
       "    r = random.randint(0, len(predict_dataset))\n",
       "    images.append(predict_dataset[r][0])\n",
       "    labels.append(fashion_items[predict_dataset[r][1]])\n",
       "\n",
       "plot_examples(images, labels, columns=5)\n",
       "# keep the semicolon here or it will output the images twice"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "8830f532",
      "metadata": {},
      "outputs": [],
      "source": []
     }
    ],
    "metadata": {
     "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
     },
     "language_info": {
      "codemirror_mode": {
       "name": "ipython",
       "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.6"
     }
    },
    "nbformat": 4,
    "nbformat_minor": 5
   }

SambaNova recommends that you create your own directory inside your home directory for the tutorial code:

  1. Log in to your SambaNova environment.

  2. Create a directory for the tutorials, and a subdirectory for lenet.

    $ mkdir $HOME/tutorials
    $ mkdir $HOME/tutorials/lenet
  3. Copy the files that you just downloaded into $HOME/tutorials/lenet.

Prepare the dataset

This tutorial uses the Fashion MNIST dataset (images of items of clothing) available on https://github.com/zalandoresearch/fashion-mnist. Fashion MNIST is a drop-in replacement for the classic MNIST dataset (images of handwritten digits):

  • 60,000 images in the training set

  • 10,000 images in the test set.

We decided to use Fashion MNIST in this tutorial because it’s a little more challenging to train than the original MNIST dataset and it looks more interesting.

  1. Create a subdirectory for your datasets in your home directory. In this example we use $HOME/datasets.

    $ mkdir -p $HOME/datasets/
  2. Create the subdirectory for the Fashion MNIST dataset and set the DATADIR environment variable to point to this location.

    $ mkdir -p $HOME/datasets/fashion-mnist
    $ export DATADIR=$HOME/datasets/fashion-mnist
  3. Download and extract the datasets.

    $ wget -P $DATADIR http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
    $ wget -P $DATADIR http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
    $ wget -P $DATADIR http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
    $ wget -P $DATADIR http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
    $ cd $DATADIR
    $ gunzip *gz

Compile the model

Before you can train a model to run on RDU, you have to compile for training.

Each model’s code contains a compilation function. You call the function as python <model.py> compile <compile_args>. See the Compiler Reference for some background on compiler arguments.

How to compile the model

  1. Change to the model directory:

    $ cd ~/tutorials/lenet
  2. Compile the model

    $ python lenet.py compile --batch-size 32 \
      --pef-name lenet-b32

    Compilation messages are sent to stdout. You can ignore most messages. At the end of that output you will see the following message:

    [info   ] PEF file /home/snuser1/tutorials/lenet/out/lenet-b32/lenet-b32.pef created
    You’ll need the PEF file that the compiler generates to run training, testing, and inference.

Compilation arguments

Before calling the compile command have a look at available compilation arguments.

  • All models support the shared arguments that documented in the Compiler Reference.

  • All models support an additional set of experimental shared arguments, usually used with working with SambaNova Support. To include these arguments in the help output, run <model_name>.py compile --debug --help.

  • Each model has an additional set of model-specific arguments. Those arguments are different for different models.

To get a list of possible arguments call:

$ python lenet.py compile --help

If you run the compile command without parameters, the compiler uses a set of reasonable defaults.

The compiler reference document discusses arguments used by all models. Here’s a list of other arguments:

--num-classes. Defines the number of classes used for classification. In this example you use the MNIST dataset to recognize handwritten digits from 0 to 9, so the number of classes is 10. This is the value that is set in the application’s code.

--num-features. Defines the number of pixels for each image in the dataset. With the Fashion MNIST dataset we use in this tutorial, each picture is 28×28 pixels, so num-features is 784. This is the value that is set in the application’s code.

--output-folder. Output folder where compilation artifacts will be stored. By default, the compiler creates a folder called out in your current folder. Inside that directory the compilation script creates a separate directory for each compilation run. See https://docs.sambanova.ai/developer/latest/compiler-reference.html#_pef-name.

Train the model

SambaFlow supports a run command for training, testing, and inference.

Common arguments to run

You can check the available command-line options by using --help:

$ python lenet.py run --help

Many run arguments are predefined by SambaFlow, but most models also define model-specific arguments. The most important arguments for this tutorial are:

  • --pef. Full path for a PEF file that was generated by the compiler. Copy-paste the filename from the last line of the compilation output.

  • --data-dir. Data directory. In this tutorial, the directory to which you downloaded the MNIST dataset.

  • --ckpt-dir. During training, SambaFlow saves checkpoints to this directory. You can later load a checkpoint to continue a training run that was interrupted, or load a checkpoint for inference.

  • --init-ckpt-path. Full path for a checkpoint file. Use this file path to continue training if you stopped.

Train for one epoch

  1. Start a training run for one epoch using:

    • The dataset you downloaded before.

    • The PEF file you generated in the compilation step.

      $ export DATADIR=$HOME/datasets
      $ python lenet.py run \
          --batch-size 32 \
          -p out/lenet-b32//lenet-b32.pef \
          --data-dir $DATADIR \
          --ckpt-dir checkpoints
  2. With this model and dataset, training should not take more than a minute. On stdout, you see a training log, which includes accuracy and loss. Here’s an example, abbreviated in the middle.

    Using dataset: /home/snuser1/datasets/fashion-mnist/train
    ==============================
    Initial epoch:   0, initial step:      0
    Target epoch:    1, target step:    1875
    Epoch [1/1], Step [100/1875], Loss: 1.5596
    Epoch [1/1], Step [200/1875], Loss: 1.2914
    Epoch [1/1], Step [300/1875], Loss: 1.1413
    Epoch [1/1], Step [400/1875], Loss: 1.0423
    ...
    Epoch [1/1], Step [1400/1875], Loss: 0.7138
    Epoch [1/1], Step [1500/1875], Loss: 0.7010
    Epoch [1/1], Step [1600/1875], Loss: 0.6893
    Epoch [1/1], Step [1700/1875], Loss: 0.6792
    Epoch [1/1], Step [1800/1875], Loss: 0.6695
  3. Verify that the model saved a checkpoint file under ./checkpoints. The file name corresponds to the number of training steps taken.

    $ ls  ./checkpoints/

Train for two and more epochs using the checkpoint

You can continue training from the checkpoint that was saved during the first training run. For more complex models, multiple training runs are helpful. If you train for several epochs and each epoch takes significant time (hours or days):

  1. Stop training after several epochs.

  2. Start training again the next day from the last saved checkpoint.

Using checkpoints is also helpful when you experience an interrupt in the training run for some reason (e.g. hardware or software failure)--just start training from the last checkpoint!

To start training from a saved checkpoint, specify the checkpoint file with the --init-ckpt-path argument and specify the total number of epochs to train for with --num-epochs. In this example we train for two total epochs. The checkpoint was saved after one epoch, so this second training run will be for one more epoch.

  1. For the second training run, run this command:

    $ python lenet.py run \
        --batch-size 32 \
        -p out/lenet-b32/lenet-b32.pef \
        --data-dir $DATADIR \
        --ckpt-dir checkpoints \
        --init-ckpt-path checkpoints/1875.pt \
        --num-epochs 2

    This time the training run started from 1875 steps and reached 3750 steps.

  2. Examine the output, which shows that the loss goes down and the accuracy increases a bit. Here’s an example, abbreviated in the middle.

    Using dataset: /home/snuser1/datasets/fashion-mnist/train
    ==============================
    Initial epoch:   1, initial step:   1875
    Target epoch:    2, target step:    3750
    Epoch [2/2], Step [1975/3750], Loss: 0.4920
    Epoch [2/2], Step [2075/3750], Loss: 0.4945
    Epoch [2/2], Step [2175/3750], Loss: 0.4875
    Epoch [2/2], Step [2275/3750], Loss: 0.4927
    ...
    Epoch [2/2], Step [3275/3750], Loss: 0.4761
    Epoch [2/2], Step [3375/3750], Loss: 0.4745
    Epoch [2/2], Step [3475/3750], Loss: 0.4729
    Epoch [2/2], Step [3575/3750], Loss: 0.4720
    Epoch [2/2], Step [3675/3750], Loss: 0.4707
  3. Verify that the resulting checkpoint is saved under ./checkpoints/ as 3750.pt.

    $ ls checkpoints/
    3750.pt  1875.pt
  4. Optionally, use the new checkpoint to train for the third and other epochs by changing the number of epochs and the checkpoint file name. For example:

    $ python lenet.py run \
        --batch-size 32 \
        -p out/lenet-b32/lenet-b32.pef \
        --data-dir $DATADIR \
        --ckpt-dir checkpoints \
        --init-ckpt-path checkpoints/3750.pt \
        --num-epochs 3
  5. Verify that the loss decreases and the accuracy increases—​but only by just a notch. For this simple model we can stop training after 2-3 epochs. For more complex models and datasets, the number of epochs you need for optimal accuracy is different.

Test model accuracy

After you trained the model for several epochs, you can test its accuracy.

  1. Pick one of the saved checkpoints and run a test against the test dataset.

    $ python lenet.py run --test \
        --batch-size 32 \
        -p out/lenet-b32/lenet-b32.pef \
        --data-dir $DATADIR \
        --init-ckpt-path checkpoints/1875.pt
  2. Verify that your output looks like this (your exact numbers might be different):

    Using dataset: /home/snuser1/datasets/fashion-mnist/t10k
    Test Accuracy: 82.57  Loss: 0.5030
  3. To compare the accuracy for different number of epochs, run the same command with different checkpoint filenames and compare the accuracy and loss numbers. If the accuracy is still steadily increasing and the loss is decreasing, then running the model for more epochs will likely increase accuracy.

  4. Run the model for more epochs if you expect benefits.

Run and verify inference

When you are satisfied with the accuracy you can use the model for inference. Inference means that you use a file with the same data format but without labels. The inference run adds the labels.

For large models, the workflow includes a separate compilation step for inference. For simple models, that step isn’t necessary. We can use the PEF file we generated during the initial compilation.

Run inference

To run inference:

  1. Create a new file with images but no labels from the test dataset.

    $ cp $DATADIR/t10k-images-idx3-ubyte $DATADIR/inference-images-idx3-ubyte
  2. Run inference for this new dataset. You pass in both the PEF and the checkpoint file.

    $ python lenet.py run --inference \
        --batch-size 32 \
        -p out/lenet-b32/lenet-b32.pef \
        --data-dir $DATADIR \
        --dataset-name inference \
        --results-dir ./results \
        --init-ckpt-path checkpoints/3750.pt

    The command generates a list of predictions and stores it in the same format as the labels file.

  3. To verify that the predictions file has been created, go to the results directory.

    $ ls -l ./results
  4. Look for a recently created file named similar to MNIST label files. For example:

    -rw-rw-r--. 1 snuser1 snuser1 10008 Jul  6 13:04 prediction-labels-idx1-ubyte

Check predictions

To check predictions, it’s easiest to look at the images that were used for inference and at the generated prediction (labels). Then we can estimate the prediction accuracy visually.

  1. Check that Jupyter Notebook is running on the RDU host by running the following command:

    $ cd ~/sambanova-tutorials/lenet
    $ nohup jupyter-notebook --ip=0.0.0.0 --no-browser &
  2. Enter the Jupyter URL in a browser that has access to the RDU host. You will see a list of files including visualize_predictions.ipynb.

  3. Open visualize_predictions.ipynb. You should see something like this.

    Jupyter notebook to visualize predictions
  4. Run the notebook cell by cell (or all cells altogether). At the bottom you will see the predictions which look like this:

    Model predictions in Jupyter notebook
  5. Try to estimate visually how many of the images the model got right and wrong.