Image classification models

Image classification is the task of categorizing an image into one or multiple predefined classes by assigning it to a specific label. This document provides information for SambaStudio’s Image classification model Vit_B_Classification.

Data preparation

In an image classification task, the classification data typically consists of a set of images along with corresponding labels. The images can be in any standard format, such as JPEG (Joint Photographic Experts Group) or PNG (Portable Network Graphic).

Training dataset requirements

An uploaded dataset is expected to have the following components:

A directory of images.
A labels.csv file, responsible for mapping the image location to the image label and identifier of train, test, and validation split.
A class_to_idx.json file [Optional]. This file is responsible for mapping the class’s verbose name to the class index. It provides a way to retrieve the human-readable interpretation from the index number that corresponds to a specific class label.

The uploaded data should have a directory structure similar to the below example.

Example directory structure

.
└── data_root/
    ├── images/
    ├── labels.csv
    └── class_to_idx.json  # Optional

The directory name images in the above example is not strictly required. The only requirement is that the image path specified in labels.csv is relative to the data_root directory.

For example, the following directory is also valid assuming labels.csv points to images in the location train/path/to/image.jpeg.

.
└── data_root/
    ├── train/
    ├── test/
    └── labels.csv

Image formats

JPEG (.jpg extension) and PNG (.png extension) are allowed formats. All images should be three channel RGB, with uint8 encoding. For example, if initial images have a fourth alpha channel, this will need to be removed during the dataset processing step.

Labels CSV

You are required to provide a .csv file specifying the appropriate image-label pair in addition to an indicator if the data should be treated as training data or testing data. This information is denoted by the column headers as described below:

image_path header denotes the relative path to a given image inside of the dataset directory.
label header denotes the class id included in the image, ranging from [0..n-1], where n is the number of classes. In the case of multi-label classification, labels are separated by a space if a sample has multiple labels present.
subset header denotes one of train, test, or validation. Indicating if the image is in the training, test, or validation set.
metadata [Optional] header denotes information relating to the given input data.

Example .csv file

$ column -s, -t caltech256.csv | head -n 4
image_path                                  label  subset       metadata
./images/138.mattress/138_0117.jpg          0      train
./images/138.mattress/138_0103.jpg          0 3 11 validation
./images/138.mattress/138_0088.jpg          0      train

Where the column command is used for pretty-printing the .csv for display purposes.

The class index mapping file

The class_to_idx.json file is responsible for mapping the human-interpretable name to the class index. The expected format of this files is an string-index mapping dictionary.

Example class_to_idx.json file

$ python -m json.tool imagenet1000.json | head
{
    "tench, Tinca tinca": 0,
    "goldfish, Carassius auratus": 1,
    "great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias": 2,
    "tiger shark, Galeocerdo cuvieri": 3,
    "hammerhead, hammerhead shark": 4,
    "electric ray, crampfish, numbfish, torpedo": 5,
    "stingray": 6,
    "cock": 7,
    "hen": 8,

The class_to_idx.json file is not used in the app workflow. It is included so there is a record of how the dataset was created so a user can retrieve this at a later date.

The app does not check the existence of this file or its correctness.

CIFAR 100 example

Download the CIFAR 100 data from https://www.cs.toronto.edu/~kriz/cifar.html

import asyncio
import aiofiles
from io import BytesIO
from PIL import Image
import pickle
from pathlib import Path
import pandas as pd
import random


# Change to false if only the labels.csv file needs to be processed
SAVE_IMAGE = True

# The async code will open too many files at one time. Let's limit this
num_of_max_files_open = 200

data_dir = Path('./data')
data_dir.mkdir(exist_ok=True)


def unpickle(file):
    with open(file, 'rb') as fo:
        data = pickle.load(fo, encoding='bytes')
    return data


def load_subset(subset):
    data = unpickle(f'./cifar-100-python/{subset}')

    filenames = data[b'filenames']
    labels = data[b'fine_labels']
    images = data[b'data']

    assert len(labels) == len(images)
    assert len(filenames) == len(images)

    return filenames, labels, images


async def save_image(path: str, image: memoryview) -> None:
    async with aiofiles.open(path, "wb") as file:
        await file.write(image)


async def write_image(filename, label, image, subset, row, sem):
    subset_dir = data_dir / subset
    filepath = subset_dir / filename.decode()

    async with sem:
        # Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the
        # next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32
        # entries of the array are the red channel values of the first row of the image.
        if SAVE_IMAGE:
            image = image.reshape(3, 32, 32).transpose(1, 2, 0)
            img = Image.fromarray(image)
            buffer = BytesIO()
            img.save(buffer, format='png')
            await save_image(filepath, buffer.getbuffer())

    if row % 100 == 0:
        print(f"{row:05d}", flush=True)

    if subset == 'test':
        # we use the ``test`` set as the ``validation`` set in this example
        subset = 'validation'

    return [str(filepath.relative_to(data_dir)), str(label), subset, str('')]


async def process_subset(subset):

    subset_dir = data_dir / subset
    subset_dir.mkdir(exist_ok=True)

    tasks = []
    sem = asyncio.Semaphore(num_of_max_files_open)
    filenames, labels, images = load_subset(subset)

    for row, sample in enumerate(zip(filenames, labels, images)):
        tasks.append(asyncio.ensure_future(write_image(*sample, subset=subset, row=row, sem=sem)))

    results = await asyncio.gather(*tasks)
    df = pd.DataFrame(results, columns=["image_path", "label", "subset", "metadata"])
    return df


async def main():

    print("Processing training images")
    train_df = await process_subset('train')
    print("Processing test images")
    test_df = await process_subset('test')
    df = pd.concat([train_df, test_df])
    df.to_csv(data_dir / 'labels.csv', index=False)


asyncio.run(main())

Batch inference dataset requirements

A batch prediction dataset is expected to have the following components:

A directory of images.
A predictions.csv file, responsible for mapping the image location.

Formatting a dataset for batch prediction should have a directory structure similar to the below example.

Batch prediction directory structure

.
└── data_root/
    ├── images/
    └── predictions.csv

Predictions CSV

The predictions.csv will have the same format as labels.csv however columns for label, subset, and metadata are ignored.

Example predictions.csv file

$ column -s, -t predictions.csv | head -n 4
image_path                                  label  subset  metadata
138.mattress/138_0117.jpg                   0      train
138.mattress/138_0103.jpg                   0 3 11 validation
138.mattress/138_0088.jpg                   0      train

Outputs

The outputs returned from a given prediction sample will be the probabilities over all outputs classes of the model.

[
    {'input': 'path/relative/to/root_dir', 'predictions': [.1, .3, .6]}
    ...
]

There will be an input-prediction for each of the images passed to the infer API.

The probabilities associated with a prediction will only sum to 1 in the multi-class (single label) instance. For multi-label classification, we return the classification probability of each class independently. The outputs are not guaranteed to sum to 1.

Predicted response is written to result_dir/ under the filename predictions.jsonl using the JSON Lines format.

Hyperparameters and settings

The hyperparameters and settings for the Image classification models when creating a training job are described below.

Parameter Definition Default value Allowed values

Parameter	Definition	Default value	Allowed values
`attention_dropout_rate`	Dropout rate used in the attention block during training.	0.0	0 ⇐ x < 1
`batch_size`	Number of samples in a batch.	64	1 or 64
`dropout_rate`	Dropout rate used in the multilayer perceptron (MLP) during training.	0.0	0 ⇐ x < 1
`learning_rate`	Max learning rate used in the OneCycleLR scheduler.	0.0001	0 < x
`logging_steps`	Number of steps between logging metrics.	50	1 ⇐ x
`multilabel`	Enables multilabel classification instead of the original multiclass classification.	false	true or false
`num_epochs`	Number of times to loop over the training dataset.	3	1 ⇐ x
`num_intended_classes`	Total number of classes to predict.	257	1 ⇐ x ⇐ 1000
`weight_decay`	Weight decay throughout the training run.	0.0	0 ⇐ x

attention_dropout_rate

Dropout rate used in the attention block during training.

0.0

0 ⇐ x < 1

batch_size

Number of samples in a batch.

1 or 64

dropout_rate

Dropout rate used in the multilayer perceptron (MLP) during training.

0.0

0 ⇐ x < 1

learning_rate

Max learning rate used in the OneCycleLR scheduler.

0.0001

0 < x

logging_steps

Number of steps between logging metrics.

1 ⇐ x

multilabel

Enables multilabel classification instead of the original multiclass classification.

false

true or false

num_epochs

Number of times to loop over the training dataset.

1 ⇐ x

num_intended_classes

Total number of classes to predict.

257

1 ⇐ x ⇐ 1000

weight_decay

Weight decay throughout the training run.

0.0

0 ⇐ x

Inference settings

The inference settings for Image classification models when creating a batch inference job are described below.

Parameter Definition Value Allowed values

Parameter	Definition	Value	Allowed values
`batch_size`	Number of samples in a batch.	64	1 or 64
`multilabel`	Enables multilabel classification instead of the original multiclass classification.	false	true or false
`num_intended_classes`	Total number of classes to predict.	257	1 ⇐ x ⇐ 1000