Using LayerNorm instead of BatchNorm
The SambaNova hardware architecture takes full advantage of pipeline parallelism. BatchNorm is a popular way to improve training performance because pipelining on the batch dimension of a tensor often produces the best performance. However, using parallelization with BatchNorm requires resynchronization after each normalization operation: samples must be batched before they can be processed at each stage and passed to the next.
To avoid this need for synchronization, SambaNova recommends using LayerNorm in most settings.
The structure of your model and data affects whether BatchNorm or LayerNorm gives you the best results. Both approaches are supported. |
You can read more about various normalization methods here: link: https://medium.com/techspace-usict/normalization-techniques-in-deep-neural-networks-9121bf100d8.
Customize the model code
This tutorial uses a simple CNN model from a popular PyTorch tutorial. The model solves the classic machine learning problem of recognizing hand-written letters from the MNIST dataset.
It’s a simple 2-layer CNN that uses Conv2d
, BatchNorm2d
, MaxPool2d
.
The output layer is a fully connected one.
Original code (BatchNorm)
Here is the original model code:
# Convolutional neural network (two convolutional layers)
class ConvNet(nn.Module):
def __init__(self, num_classes=10):
super(ConvNet, self).__init__()
self.layer1 = nn.Sequential(
nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
nn.BatchNorm2d(16),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.layer2 = nn.Sequential(
nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.fc = nn.Linear(7*7*32, num_classes)
def forward(self, x):
out = self.layer1(x)
out = self.layer2(out)
out = out.reshape(out.size(0), -1)
out = self.fc(out)
return out
Revised code (LayerNorm)
As an alternative, you can revise the code to use LayerNorm. The following code fragment replaces BatchNorm2d
with LayerNorm
and includes a few other changes. Here’s the revised code with comments below.
class ConvNet(nn.Module):
def __init__(self, num_classes=10, input_shape=[28, 28]): (1)
super(ConvNet, self).__init__()
self.layer1 = nn.Sequential(
nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
nn.LayerNorm([16] + input_shape), (2)
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
input_shape = [input_shape[0] // 2, input_shape[1] // 2] (3)
self.layer2 = nn.Sequential(
nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
nn.LayerNorm([32] + input_shape),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.fc = nn.Linear(7*7*32, num_classes)
self.criterion = nn.CrossEntropyLoss() (4)
def forward(self, x, labels):
out = self.layer1(x)
out = self.layer2(out)
out = out.reshape(out.size(0), -1)
out = self.fc(out)
loss = self.criterion(out, labels)
return loss, out (5)
1 | To perform LayerNorm use the normalized shape (as specified in the PyTorch documentation). For that we will use the input shape so we have to pass it as an argument. | ||
2 | Add the input_shape to the number of features used for BatchNorm2d() and use the result as an argument for the LayerNorm() function.
|
||
3 | For the second layer, reduce the input_shape . Because the previous MaxPool2d call uses stride=2 , divide the input_shape by 2.
If you use a different stride in your model, divide input_shape by that value. |
||
4 | We recommend adding the loss function to the model’s class so that it will be calculated on the RDU. Some loss functions might not be supported yet in SambaFlow; in that case you can calculate the loss on the host. | ||
5 | The forward() function returns the output and loss inside the model. |
Compile the Model
Before you can run the model on an RDU you have to compile it.
You can use the samba.session.compile()
service function. You pass your model to that function and it produces a PEF file — a binary file that will be uploaded to the RDU.
Here’s how you use the compile()
function.
args = parse_app_args(dev_mode=True,
common_parser_fn=add_common_args)
utils.set_seed(256)
model = ConvNet(args.num_classes)
samba.from_torch_(model)
inputs = get_inputs(args)
optimizer = samba.optim.AdamW(model.parameters(), lr=args.lr) if not args.inference else None
if args.command == "compile":
samba.session.compile(model,
inputs,
optimizer,
name='cnn_mnist',
app_dir=utils.get_file_dir(__file__),
squeeze_bs_dim=True,
config_dict=vars(args),
pef_metadata=get_pefmeta(args, model))
Prepare the datasets
Data preparation is different in the original tutorial and in our revision for LayerNorm.
Original data preparation (BatchNorm)
This is how dataset preparation is done in the original tutorial:
# MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='../../data/',
train=True,
transform=transforms.ToTensor(),
download=True)
test_dataset = torchvision.datasets.MNIST(root='../../data/',
train=False,
transform=transforms.ToTensor())
# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=batch_size,
shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
batch_size=batch_size,
shuffle=False)
Revised data preparation (LayerNorm)
To use the loaders with LayerNorm in SambaFlow you have to add one parameter to the DataLoader
calls: drop_last=True
.
The parameter ensures that every batch has the same batch size. Read more here.
Here is the code with the additional parameter:
# MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='../../data/',
train=True,
transform=transforms.ToTensor(),
download=True)
test_dataset = torchvision.datasets.MNIST(root='../../data/',
train=False,
transform=transforms.ToTensor())
# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=args.batch_size,
shuffle=True,
num_workers=7,
drop_last=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
batch_size=args.batch_size,
shuffle=False,
num_workers=7,
drop_last=True)
Train the model
To train the model on an RDU, you have to convert it from using native PyTorch Tensors to SambaTensors.
Original training code (BatchNorm)
Here’s the original code used to train the model:
# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
for i, (images, labels) in enumerate(train_loader):
images = images.to(device)
labels = labels.to(device)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i+1) % 100 == 0:
print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
.format(epoch+1, num_epochs, i+1, total_step, loss.item()))
Revised training code (LayerNorm)
And here is the revised code, which uses SambaTensor and converts them back (discussed below).
# Train the model
total_step = len(train_loader)
hyperparam_dict = {"lr": args.lr}
for epoch in range(num_epochs):
for i, (images, labels) in enumerate(train_loader):
images = samba.from_torch(images, name='image', batch_dim=0) (1)
labels = samba.from_torch(labels, name='label', batch_dim=0)
loss, outputs = samba.session.run(input_tensors=[images, labels],
output_tensors=model.output_tensors,
hyperparam_dict=hyperparam_dict,
data_parallel=args.data_parallel,
reduce_on_rdu=args.reduce_on_rdu)
loss, outputs = samba.to_torch(loss), samba.to_torch(outputs) (2)
if (i+1) % 100 == 0:
print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
.format(epoch+1, num_epochs, i+1, total_step, loss.item()))
1 | Convert images and labels to SambaTensors from PyTorch Tensors. |
2 | Convert the output and loss SambaTensors back to PyTorch Tensors. |
When using an RDU, all three steps are computed in a single samba.session.run()
call: forward, backward, and optimizer.
Test the model
The code for testing the model on the RDU is similar to the training code.
Because we only need to run the forward
section, we can pass that to the runtime as a parameter. Here’s the code:
# Test the model
model.eval() # eval mode (batchnorm uses moving mean/variance instead of mini-batch mean/variance)
with torch.no_grad():
correct = 0
total = 0
for images, labels in test_loader:
images = samba.from_torch(images, name='image', batch_dim=0)
labels = samba.from_torch(labels, name='label', batch_dim=0)
loss, outputs = samba.session.run(input_tensors=[images, labels],
section_types = ["fwd"], (1)
output_tensors=model.output_tensors,
data_parallel=args.data_parallel,
reduce_on_rdu=args.reduce_on_rdu)
outputs = samba.to_torch(outputs)
labels = samba.to_torch(labels)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print('Test Accuracy of the model on the 10000 test images: {} %'.format(100 * correct / total))
Main function
The main function can use command-line arguments such as compile
and run
.
-
In the first pass, you compile the model and produce a PEF file that will be placed on the RDU.
(1)
below. -
In the second pass, you use the
run
command and specify the PEF file that you want to use with the model.(2)
below.
Here is the code for the main()
function:
def main():
args = parse_app_args(dev_mode=True,
common_parser_fn=add_common_args)
utils.set_seed(256)
model = ConvNet(args.num_classes)
samba.from_torch_(model)
inputs = get_inputs(args)
optimizer = samba.optim.AdamW(model.parameters(), lr=args.lr) if not args.inference else None
if args.command == "compile": (1)
samba.session.compile(model,
inputs,
optimizer,
name='cnn_mnist',
app_dir=utils.get_file_dir(__file__),
squeeze_bs_dim=True,
config_dict=vars(args),
pef_metadata=get_pefmeta(args, model))
elif args.command == "run": (2)
#Run compiled model
utils.trace_graph(model, inputs, optimizer, pef=args.pef, mapping=args.mapping)
train(args, model, optimizer)
if __name__ == '__main__':
main()
You can find the full example here.