Best practices

General best practices

Ensure that data is properly formatted for the model you’re working with. Data issues are the most common cause for problems with compiling or training a model. We have a public GitHub repository with two scripts for pretraining data creation, pipeline.py and data_prep.py. However, you might have to take additional steps to make sure that your data work for the model you’re trying to compile and run. General data preparation for AI is beyond the scope of this documentation.
Our architecture works in natively in BF16, so BF16 supports the greatest number of operators. To avoid extra round trips to the host:
- Use BF16 in your code where possible.
- Use BF16 for checkpoints.
- Save preprocessed datasets in BF16. (Most NLP datasets use integers).
SambaFlow supports a limited set of operators. For unsupported operators, you have these options:
- Use a workaround (e.g. instead of torch.repeat(), use first tensor copy and then expand)
- Use parallel patterns. See Compose complex operations with parallel patterns.

Use the appropriate log level. See Log management arguments.
Ensure that the model compile and run phases use the same definition of the model. Otherwise, you’ll see missing or mismatched tensor errors when you run the model.
Recompilation is only needed if you change the model definition or tensor input shape.
See the following documents for more details about the compiler:

Experiment with batch sizes to determine the most efficient size. Models that include a --grad-accummulation-steps argument support changing batch size in multiples of a base batch size (--b-size argument).
Experiment with --num-workers to determine how to achieve the lowest host CPU overhead for dataloaders.
If you have noticeable transfer time between host and RDU, try including --tensormem ddr.

To examine where your model spends time and potentially improve performance:

Use the SambaTune reports or GUI. See SambaTune Documentation.
Use the Exporter for Prometheus, introduced in Release 1.19. See the Runtime release notes for additional information. Runtime release notes.
Run systemctl status snd to check the SambaNova daemon status if you see unexpected errors. See Manage SND.
Use SambaNova Fault Management (snfadm) at /opt/sambaflow/bin/snfadm for basic checks. For example:
- snfadm -l inventory | less.
- snfadm -l fault
- snfadm -l error
Use snconfig, at /opt/sambaflow/bin/snconfig for configuration information. Run snconfig --help for details on arguments.
Use sntilestat to learn about the status and utilization of each tile within each RDU. See Use sntilestat for performance analysis.