Image classification tutorial

Image classification is the task of categorizing an image into one or multiple predefined classes by assigning it to a specific label. In this exercise, we describe a general approach to image classification in the SambaStudio platform.

Data preparation

In an image classification task, the classification data typically consists of a set of images along with corresponding labels. The images can be in any standard format, such as JPEG (Joint Photographic Experts Group) or PNG (Portable Network Graphic).

Training dataset requirements

An uploaded dataset is expected to have the following components:

  • A directory of images.

  • A labels.csv file, responsible for mapping the image location to the image label and identifier of train, test, and validation split.

  • A class_to_idx.json file [Optional]. This file is responsible for mapping the class’s verbose name to the class index. It provides a way to retrieve the human-readable interpretation from the index number that corresponds to a specific class label.

The uploaded data should have a directory structure similar to the below example.

Example directory structure
.
└── data_root/
    ├── images/
    ├── labels.csv
    └── class_to_idx.json  # Optional

The directory name images in the above example is not strictly required. The only requirement is that the image path specified in labels.csv is relative to the data_root directory.

For example, the following directory is also valid assuming labels.csv points to images in the location train/path/to/image.jpeg.

.
└── data_root/
    ├── train/
    ├── test/
    └── labels.csv

Image formats

JPEG (.jpg extension) and PNG (.png extension) are allowed formats. All images should be three channel RGB, with uint8 encoding. For example, if initial images have a fourth alpha channel, this will need to be removed during the dataset processing step.

Labels CSV

You are required to provide a .csv file specifying the appropriate image-label pair in addition to an indicator if the data should be treated as training data or testing data. This information is denoted by the column headers as described below:

  • image_path header denotes the relative path to a given image inside of the dataset directory.

  • label header denotes the class id included in the image, ranging from [0..n-1], where n is the number of classes. In the case of multi-label classification, labels are separated by a space if a sample has multiple labels present.

  • subset header denotes one of train, test, or validation. Indicating if the image is in the training, test, or validation set.

  • metadata [Optional] header denotes information relating to the given input data.

Example .csv file
$ column -s, -t caltech256.csv | head -n 4
image_path                                  label  subset       metadata
./images/138.mattress/138_0117.jpg          0      train
./images/138.mattress/138_0103.jpg          0 3 11 validation
./images/138.mattress/138_0088.jpg          0      train

Where the column command is used for pretty-printing the .csv for display purposes.

The class index mapping file

The class_to_idx.json file is responsible for mapping the human-interpretable name to the class index. The expected format of this files is an string-index mapping dictionary.

Example class_to_idx.json file
$ python -m json.tool imagenet1000.json | head
{
    "tench, Tinca tinca": 0,
    "goldfish, Carassius auratus": 1,
    "great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias": 2,
    "tiger shark, Galeocerdo cuvieri": 3,
    "hammerhead, hammerhead shark": 4,
    "electric ray, crampfish, numbfish, torpedo": 5,
    "stingray": 6,
    "cock": 7,
    "hen": 8,

The class_to_idx.json file is not used in the app workflow. It is included so there is a record of how the dataset was created so a user can retrieve this at a later date.

The app does not check the existence of this file or its correctness.

CIFAR 100 example

Download the CIFAR 100 data from https://www.cs.toronto.edu/~kriz/cifar.html

import asyncio
import aiofiles
from io import BytesIO
from PIL import Image
import pickle
from pathlib import Path
import pandas as pd
import random


# Change to false if only the labels.csv file needs to be processed
SAVE_IMAGE = True

# The async code will open too many files at one time. Let's limit this
num_of_max_files_open = 200

data_dir = Path('./data')
data_dir.mkdir(exist_ok=True)


def unpickle(file):
    with open(file, 'rb') as fo:
        data = pickle.load(fo, encoding='bytes')
    return data


def load_subset(subset):
    data = unpickle(f'./cifar-100-python/{subset}')

    filenames = data[b'filenames']
    labels = data[b'fine_labels']
    images = data[b'data']

    assert len(labels) == len(images)
    assert len(filenames) == len(images)

    return filenames, labels, images


async def save_image(path: str, image: memoryview) -> None:
    async with aiofiles.open(path, "wb") as file:
        await file.write(image)


async def write_image(filename, label, image, subset, row, sem):
    subset_dir = data_dir / subset
    filepath = subset_dir / filename.decode()

    async with sem:
        # Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the
        # next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32
        # entries of the array are the red channel values of the first row of the image.
        if SAVE_IMAGE:
            image = image.reshape(3, 32, 32).transpose(1, 2, 0)
            img = Image.fromarray(image)
            buffer = BytesIO()
            img.save(buffer, format='png')
            await save_image(filepath, buffer.getbuffer())

    if row % 100 == 0:
        print(f"{row:05d}", flush=True)

    if subset == 'test':
        # we use the ``test`` set as the ``validation`` set in this example
        subset = 'validation'

    return [str(filepath.relative_to(data_dir)), str(label), subset, str('')]


async def process_subset(subset):

    subset_dir = data_dir / subset
    subset_dir.mkdir(exist_ok=True)

    tasks = []
    sem = asyncio.Semaphore(num_of_max_files_open)
    filenames, labels, images = load_subset(subset)

    for row, sample in enumerate(zip(filenames, labels, images)):
        tasks.append(asyncio.ensure_future(write_image(*sample, subset=subset, row=row, sem=sem)))

    results = await asyncio.gather(*tasks)
    df = pd.DataFrame(results, columns=["image_path", "label", "subset", "metadata"])
    return df


async def main():

    print("Processing training images")
    train_df = await process_subset('train')
    print("Processing test images")
    test_df = await process_subset('test')
    df = pd.concat([train_df, test_df])
    df.to_csv(data_dir / 'labels.csv', index=False)


asyncio.run(main())

Fine-tune

You will need to add your dataset to the platform before you can use it for fine-tuning.

Add the dataset

If you plan to use the sample dataset provided by the platform, you can skip this section and proceed to Create a project using the GUI.

Follow the steps below to add the dataset using the GUI.

  1. Click Datasets from the left menu to navigate to the Dataset Hub window.

  2. Click the Add dataset button.

    Dataset hub
    Figure 1. Dataset Hub
    1. The Add a dataset window will open.

  3. In the Dataset name field, input a name for your dataset.

    Including your dataset size in the name (corresponding to the max_seq_length value used) will help you select the appropriate dataset when creating a training job.

  4. From the Job type dropdown, select whether the dataset is to be used for Train/Evaluation or Batch inference.

  5. The Share settings drop-down provides options for which tenant to share your dataset.

    1. Share with <current-tenant> allows the dataset to be shared with the current tenant you are using, identified by its name in the drop-down.

    2. Share with all tenants allows the dataset to be shared across all tenants.

    3. Dataset will be shared with all users in <current-tenant> identifies that the dataset will be shared with other users in the tenant you are using.

      If the Dataset will be shared with all users in <current-tenant> option is displayed, the Share with <current-tenant> and Share with all tenants options described above will not be available. Share with all tenants is an optional feature of SambaStudio. Please contact your administrator or SambaNova representative for more information.

  6. From the Applicable ML Apps drop-down, select the ML App(s) that you wish the dataset to be associated. Multiple ML Apps can be selected.

    Be sure to select appropriate ML Apps that correspond with your dataset, as the platform will not warn you of ML Apps selected that do not correspond with your dataset.

    Add dataset
    Figure 2. Add a dataset

Import the dataset from AWS S3

Follow the steps below to import your dataset from AWS S3.

  • The dataset is imported from AWS S3 only once during dataset creation.

  • AWS S3 credentials are not stored.

  1. Select AWS from the Source drop-down.

  2. In the Bucket field, input the name of your S3 bucket.

  3. Input the relative path to the dataset in the S3 bucket into the Folder field. This folder should include the required dataset files for the task (for example, the labels, training, and validation files).

  4. In the Access key ID field, input the unique ID provided by AWS IAM to manage access.

  5. Enter your Secret access key into the field. This allows authentication access for the provided Access Key ID.

  6. Enter the AWS Region that your S3 bucket resides into the Region field.

    An Access key, Secret access key, and user access permissions are required for AWS S3 import.

    Import AWS S3
    Figure 3. Import from AWS S3

Create a project using the GUI

Follow the steps below to create a project using the GUI.

  1. Click Projects from the left menu to view the Projects window.

  2. Click New project. The Create a new Project window will open.

  3. Enter a name to be used into Project name field.

  4. Add a brief description into the Description field.

  5. Confirm that you want to create the new project:

    1. Click Save and close to create your project and go to the Projects window.

    2. Click Save and continue to create your project and go to your new project’s window. From the project window you can create a new job or new endpoint to be associated with your project.

    3. Click Cancel to stop the creation of a new project and return to the Projects window.

Create Project
Figure 4. Create a project

Training jobs

You can fine-tune existing models by creating a new training job. The platform will perform the required data processing, such as tokenization, behind the scenes. You can select either a platform provided dataset or your own dataset.

Create a training job using the GUI

You no longer need to specify a Task during job creation. Instead, you select a model to use directly during the workflow or use the new ML App field to filter the list of models to use.

Create a new training job using the GUI for fine-tuning by following the steps below.

To navigate to a project, click Projects from the left menu and then select the desired project from the Projects window. See the Projects document for information on how to create and delete projects.

  1. Create a new project or use an existing one.

  2. From a project window, click New job. The Create a new job window will appear.

  3. Select Train model under Create a new job.

  4. Enter a name for the job into the Job name field.

  5. Select Image classification from the ML App drop-down.

  6. From the Select dataset drop-down, choose My datasets, SambaNova datasets, or Select from datasets.

    1. My datasets displays a list of datasets that you have added to the platform and can be used for the selected ML App.

    2. SambaNova datasets displays a list of platform provided datasets for the selected ML App.

      Create training job
      Figure 5. Create a training job
    3. Select from datasets displays the Dataset Hub window with a detailed list of datasets that can be used for the selected ML App. The My datasets and SambaNova checkboxes filter the dataset list by their respective group. The ML App drop-down filters the dataset list by the corresponding ML App. Choose the dataset you wish to use and confirm your choice by clicking Use dataset.

      Dataset Hub
      Figure 6. Dataset Hub
  7. Set the hyperparameters to govern your training job or use the default values. Expand the Hyperparameters & settings pane by clicking the blue double arrows to set hyperparameters and adjust settings.

    The num_intended_classes setting needs to match the number of classes in your dataset. For the CIFAR 100 example, the num_intended_classes setting would be 100.

    Hyperparameters
    Figure 7. Hyperparameters & settings

Evaluate the job using the GUI

Navigate to a Training job’s detail page during the job run (or after its completion) to view job information, generated checkpoints, and metrics. You can evaluate a checkpoints' accuracy, loss, and other metrics to determine if the checkpoint is of sufficient quality to deploy.

Navigate to a Training job’s detail page from the Dashboard or from its associated Project page.

View information and metrics

You can view the following information and metrics about your training job.

Model

Displays the model name and architecture used for training.

Dataset

Displays the dataset used, including its size.

Details & Hyperparameters

Displays the number of RDUs utilized and batch size. Click More to view a list of the hyperparameters and settings used during training. Click Less to hide the hyperparameters and settings list.

Generative tuning RDUs
Figure 8. Expanded Details & Hyperparameters
Progress bar

The progress bar displays the state of the training job as well as the percentage completed of the training run.

Metrics graph

Displays the various metrics generated during the training run. GPT 1.5B models, such as GPT_1.5B_NER_FINETUNED, generate additional metrics. Click Expand to view the additional metrics. Click Collapse to hide the additional metrics.

Expanded metrics
Figure 9. Expanded additional metrics
Checkpoints table

The Checkpoints table displays generated checkpoints of your training run.

  • You can customize your view of the Checkpoints table by enabling/disabling columns, from the Columns drop-down, to help you focus on comparing metrics that are relevant to you.

  • Download a CSV file of your checkpoints by clicking Export and selecting Download as CSV from the drop-down. The CSV file will be downloaded to the location configured by your browser.

  • From the Actions column drop-down, you can select Save to Model Hub or Delete for a checkpoint.

    Checkpoints table
    Figure 10. Checkpoints table
  • For GPT 1.5B models, you can view the Confusion matrix for a checkpoint that can be used to further understand checkpoint performance. From the Actions drop-down, click Checkpoint metrics. The Confusion matrix window will open.

    Confusion matrix
    Figure 11. Confusion matrix

    All labels listed in your labels file must be represented in the validation dataset. This ensures that the confusion matrix does not generate errors associated with missing labels or incorrectly attributed metrics.

Evaluate the job using the CLI

Similar to the GUI, the SambaNova API (snapi) provides feedback on job performance via the CLI. The example below demonstrates the snapi job metrics command. You will need to specify the following:

  • The project that contains, or is assigned to, the job you wish to view metrics.

  • The name of the job you wish to view metrics.

If a Confusion Matrix can be generated for the job, the path to the generated matrix will be displayed in the output.

Example snapi job metrics command
$ snapi job metrics \
   --project <project-name> \
   --job <job-name>

TRAINING
INDEX        TRAIN_LOSS        TRAIN_STEPS
  0             0.0                0.0
  1             2.47               10.0
  2             2.17               20.0
  3             2.02               30.0
  4             2.06               40.0
  5             2.01               50.0
  6             2.0                60.0
  7             1.93               70.0
  8             2.0                80.0
  9             1.95               90.0
  10            2.0               100.0

VALIDATION
INDEX        VAL_STEPS        VAL_LOSS        VAL_STEPS_PER_SECOND
  0             0.0             2.04                  0.13
  1             50.0            2.03                  0.13
  2            100.0            2.03                  0.13

Confusion Matrix generated here -> <path-to-generated-confusion-matrix-jpeg>

Run snapi job metrics --help to display additional usage and options.

Reviewing metrics

Navigate to a Training job’s detail page from the Dashboard or from its associated Project page to review its metrics.

  • The details page provides information about your training job including its completion status. Click Expand to view additional information.

    Image classification training details
    Figure 12. Training details
  • Click Collapse to hide additional information.

    Expanded information
    Figure 13. Expanded additional information
  • Customize your view of the Checkpoints table by enabling/disabling columns to help you focus on comparing information that is relevant to you.

    • Download a CSV file of your checkpoints by clicking Export and selecting Download as CSV from the drop-down.

      Checkpoints table
      Figure 14. Checkpoints table

Save a checkpoint to the model hub

Once you’ve identified a checkpoint to use for inference or further fine-tuning, save it to the Model Hub. Do this by clicking on the 3-dot menu associated with that checkpoint, select Save to Model Hub and provide a Name and Description to help you identify the checkpoint.

Save checkpoint to model hub
Figure 15. Save checkpoint to model hub

View and download logs using the GUI

The Logs section allows you to preview and download logs of your training session. Logs can help you track progress, identify errors, and determine the cause of potential errors.

Logs can be visible in the platform earlier than other data, such as metrics, checkpoints, and job progress.

  1. From the Preview drop-down, select the log file you wish to preview.

    1. The Preview window displays the latest 50 lines of the log.

    2. To view more than 50 lines of the log, use the Download all feature to download the log file.

  2. Click Download all to download a compressed file of your logs. The file will be downloaded to the location configured by your browser.

Logs
Figure 16. Logs

View logs using the CLI

Similar to viewing logs using the GUI, you can use the SambaNova API (snapi) to preview and download logs of your training session.

View the job log file names

The example below demonstrates the snapi job list-logs command. Use this command to view the job log file names of your training job. This is similar to using the Preview drop-down menu in the GUI to view and select your job log file names. You will need to specify the following:

  • The project that contains, or is assigned to, the job you wish to view the job log file names.

  • The name of the job you wish to view the job log file names.

Example snapi job list-logs command
$ snapi job list-logs \
   --project <project-name> \
   --job <job-name>
train-0fb0568c-ca8e-4771-b7cf-e6ef156d1347-1-ncc9n-runner.log
train-0fb0568c-ca8e-4771-b7cf-e6ef156d1347-1-ncc9n-model.log

Run snapi job list-logs --help to display additional usage and options.

Preview a log file

After you have viewed the log file names for your training job, you can use the snapi job preview-log command to preview the logs corresponding to a selected log file. The example below demonstrates the snapi job preview-log command. You will need to specify the following:

  • The project that contains, or is assigned to, the job you wish to preview the job log file.

  • The name of the job you wish to preview the job log file.

  • The job log file name you wish to preview its logs. This file name is returned by running the snapi job list-logs command, which is described above.

Example snapi job preview-log command
$ snapi job preview-log \
   --project <project-name> \
   --job <job-name> \
   --file train-0fb0568c-ca8e-4771-b7cf-e6ef156d1347-1-ncc9n-runner.log
2023-08-10 20:28:46  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  Runner starting...

2023-08-10 20:28:46  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  Runner successfully started

2023-08-10 20:28:46  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  Received new train request

2023-08-10 20:28:46  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  Connecting to modelbox at localhost:50061

2023-08-10 20:28:54  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  Running training

2023-08-10 20:28:54  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  Staging dataset

2023-08-10 20:28:54  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  initializing metrics for modelbox:0

2023-08-10 20:28:54  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  initializing checkpoint path for modelbox:0

2023-08-10 20:31:35  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  Preparing training for modelbox:0

2023-08-10 20:31:35  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  Running training for modelbox

Run snapi job preview-log --help to display additional usage and options.

Download the logs

Use the snapi download-logs command to download a compressed file of your training job’s logs. The example below demonstrates the snapi download-logs command. You will need to provide the following:

  • The project that contains, or is assigned to, the job you wish to download the compressed log file.

  • The name of the job you wish to download the compressed log file.

Example snapi download-logs command
$ snapi job download-logs \
   --project <project-name>> \
   --job <job-name>
Successfully Downloaded: <job-name> logs

The default destination for the compressed file download is the current directory. To specify a destination directory, use the --dest option. Run snapi job download-logs --help for more information and to display additional usage and options.