Using LayerNorm instead of BatchNorm

The SambaNova hardware architecture takes full advantage of pipeline parallelism. BatchNorm is a popular way to improve training performance because pipelining on the batch dimension of a tensor often produces the best performance. However, using parallelization with BatchNorm requires resynchronization after each normalization operation: samples must be batched before they can be processed at each stage and passed to the next.

To avoid this need for synchronization, SambaNova recommends using LayerNorm in most settings.

The structure of your model and data affects whether BatchNorm or LayerNorm gives you the best results. Both approaches are supported.

You can read more about various normalization methods here: link: https://medium.com/techspace-usict/normalization-techniques-in-deep-neural-networks-9121bf100d8.

Customize the model code

This tutorial uses a simple CNN model from a popular PyTorch tutorial. The model solves the classic machine learning problem of recognizing hand-written letters from the MNIST dataset. It’s a simple 2-layer CNN that uses Conv2d, BatchNorm2d, MaxPool2d. The output layer is a fully connected one.

Original code (BatchNorm)

Here is the original model code:

# Convolutional neural network (two convolutional layers)
class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7*7*32, num_classes)

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out

Revised code (LayerNorm)

As an alternative, you can revise the code to use LayerNorm. The following code fragment replaces BatchNorm2d with LayerNorm and includes a few other changes. Here’s the revised code with comments below.

class ConvNet(nn.Module):
    def __init__(self, num_classes=10, input_shape=[28, 28]): (1)
        super(ConvNet, self).__init__()

        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.LayerNorm([16] + input_shape),   (2)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))

        input_shape = [input_shape[0] // 2, input_shape[1] // 2]  (3)
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.LayerNorm([32] + input_shape),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))

        self.fc = nn.Linear(7*7*32, num_classes)
        self.criterion = nn.CrossEntropyLoss()  (4)

    def forward(self, x, labels):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        loss = self.criterion(out, labels)
        return loss, out  (5)

1 To perform LayerNorm use the normalized shape (as specified in the PyTorch documentation). For that we will use the input shape so we have to pass it as an argument.

Add the input_shape to the number of features used for BatchNorm2d() and use the result as an argument for the LayerNorm() function.

The input argument for LayerNorm is [C, H, W] so you need to provide an additional H, W for the input argument.

3 For the second layer, reduce the input_shape. Because the previous MaxPool2d call uses stride=2, divide the input_shape by 2. If you use a different stride in your model, divide input_shape by that value.

4 We recommend adding the loss function to the model’s class so that it will be calculated on the RDU. Some loss functions might not be supported yet in SambaFlow; in that case you can calculate the loss on the host.

5 The forward() function returns the output and loss inside the model.

Compile the Model

Before you can run the model on an RDU you have to compile it. You can use the samba.session.compile() service function. You pass your model to that function and it produces a PEF file — a binary file that will be uploaded to the RDU.

Here’s how you use the compile() function.

    args = parse_app_args(dev_mode=True,
                          common_parser_fn=add_common_args)
    utils.set_seed(256)
    model = ConvNet(args.num_classes)
    samba.from_torch_(model)

    inputs = get_inputs(args)

    optimizer = samba.optim.AdamW(model.parameters(), lr=args.lr) if not args.inference else None
    if args.command == "compile":
        samba.session.compile(model,
                              inputs,
                              optimizer,
                              name='cnn_mnist',
                              app_dir=utils.get_file_dir(__file__),
                              squeeze_bs_dim=True,
                              config_dict=vars(args),
                              pef_metadata=get_pefmeta(args, model))

Prepare the datasets

Data preparation is different in the original tutorial and in our revision for LayerNorm.

Original data preparation (BatchNorm)

This is how dataset preparation is done in the original tutorial:

# MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='../../data/',
                                           train=True,
                                           transform=transforms.ToTensor(),
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='../../data/',
                                          train=False,
                                          transform=transforms.ToTensor())

# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          shuffle=False)

Revised data preparation (LayerNorm)

To use the loaders with LayerNorm in SambaFlow you have to add one parameter to the DataLoader calls: drop_last=True.

The parameter ensures that every batch has the same batch size. Read more here.

Here is the code with the additional parameter:

    # MNIST dataset
    train_dataset = torchvision.datasets.MNIST(root='../../data/',
                                            train=True,
                                            transform=transforms.ToTensor(),
                                            download=True)

    test_dataset = torchvision.datasets.MNIST(root='../../data/',
                                            train=False,
                                            transform=transforms.ToTensor())

    # Data loader
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                               batch_size=args.batch_size,
                                               shuffle=True,
                                               num_workers=7,
                                               drop_last=True)

    test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                              batch_size=args.batch_size,
                                              shuffle=False,
                                              num_workers=7,
                                              drop_last=True)

Train the model

To train the model on an RDU, you have to convert it from using native PyTorch Tensors to SambaTensors.

Original training code (BatchNorm)

Here’s the original code used to train the model:

# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i+1) % 100 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

Revised training code (LayerNorm)

And here is the revised code, which uses SambaTensor and converts them back (discussed below).

    # Train the model
    total_step = len(train_loader)
    hyperparam_dict = {"lr": args.lr}
    for epoch in range(num_epochs):
        for i, (images, labels) in enumerate(train_loader):
            images = samba.from_torch(images, name='image', batch_dim=0)  (1)
            labels = samba.from_torch(labels, name='label', batch_dim=0)

            loss, outputs = samba.session.run(input_tensors=[images, labels],
                                              output_tensors=model.output_tensors,
                                              hyperparam_dict=hyperparam_dict,
                                              data_parallel=args.data_parallel,
                                              reduce_on_rdu=args.reduce_on_rdu)
            loss, outputs = samba.to_torch(loss), samba.to_torch(outputs) (2)

            if (i+1) % 100 == 0:
                print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
                    .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

1	Convert images and labels to SambaTensors from PyTorch Tensors.
2	Convert the output and loss SambaTensors back to PyTorch Tensors.

When using an RDU, all three steps are computed in a single samba.session.run() call: forward, backward, and optimizer.

Test the model

The code for testing the model on the RDU is similar to the training code.

Because we only need to run the forward section, we can pass that to the runtime as a parameter. Here’s the code:

    # Test the model
    model.eval()  # eval mode (batchnorm uses moving mean/variance instead of mini-batch mean/variance)
    with torch.no_grad():
        correct = 0
        total = 0
        for images, labels in test_loader:
            images = samba.from_torch(images, name='image', batch_dim=0)
            labels = samba.from_torch(labels, name='label', batch_dim=0)

            loss, outputs = samba.session.run(input_tensors=[images, labels],
                                              section_types = ["fwd"],    (1)
                                              output_tensors=model.output_tensors,
                                              data_parallel=args.data_parallel,
                                              reduce_on_rdu=args.reduce_on_rdu)
            outputs = samba.to_torch(outputs)
            labels = samba.to_torch(labels)

            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        print('Test Accuracy of the model on the 10000 test images: {} %'.format(100 * correct / total))

Main function

The main function can use command-line arguments such as compile and run.

In the first pass, you compile the model and produce a PEF file that will be placed on the RDU. (1) below.
In the second pass, you use the run command and specify the PEF file that you want to use with the model. (2) below.

Here is the code for the main() function:

def main():
    args = parse_app_args(dev_mode=True,
                          common_parser_fn=add_common_args)
    utils.set_seed(256)
    model = ConvNet(args.num_classes)
    samba.from_torch_(model)

    inputs = get_inputs(args)

    optimizer = samba.optim.AdamW(model.parameters(), lr=args.lr) if not args.inference else None
    if args.command == "compile": (1)
        samba.session.compile(model,
                              inputs,
                              optimizer,
                              name='cnn_mnist',
                              app_dir=utils.get_file_dir(__file__),
                              squeeze_bs_dim=True,
                              config_dict=vars(args),
                              pef_metadata=get_pefmeta(args, model))

    elif args.command == "run": (2)
        #Run compiled model
        utils.trace_graph(model, inputs, optimizer, pef=args.pef, mapping=args.mapping)
        train(args, model, optimizer)


if __name__ == '__main__':
    main()

You can find the full example here.