Best practices
General best practices
-
Ensure that data is properly formatted for the model you’re working with. Data issues are the most common cause for problems with compiling or training a model. We have a public GitHub repository with two scripts for pretraining data creation,
pipeline.py
anddata_prep.py
. However, you might have to take additional steps to make sure that your data work for the model you’re trying to compile and run. General data preparation for AI is beyond the scope of this documentation. -
Our architecture works in natively in BF16, so BF16 supports the greatest number of operators. To avoid extra round trips to the host:
-
Use BF16 in your code where possible.
-
Use BF16 for checkpoints.
-
Save preprocessed datasets in BF16. (Most NLP datasets use integers).
-
-
SambaFlow supports a limited set of operators. For unsupported operators, you have these options:
-
Use a workaround (e.g. instead of
torch.repeat()
, use first tensor copy and then expand) -
Use parallel patterns. See Compose complex operations with parallel patterns.
-
Compilation best practices
-
Use the appropriate log level. See Log management arguments.
-
Ensure that the model compile and run phases use the same definition of the model. Otherwise, you’ll see missing or mismatched tensor errors when you run the model.
-
Recompilation is only needed if you change the model definition or tensor input shape.
-
See the following documents for more details about the compiler:
Running your model
-
Experiment with batch sizes to determine the most efficient size. Models that include a
--grad-accummulation-steps
argument support changing batch size in multiples of a base batch size (--b-size
argument). -
Experiment with
--num-workers
to determine how to achieve the lowest host CPU overhead for dataloaders. -
If you have noticeable transfer time between host and RDU, try including
--tensormem ddr
.
Tools
To examine where your model spends time and potentially improve performance:
-
Use the SambaTune reports or GUI. See SambaTune Documentation.
-
Use the Exporter for Prometheus, introduced in Release 1.19. See the Runtime release notes for additional information. Runtime release notes.
-
Run
systemctl status snd
to check the SambaNova daemon status if you see unexpected errors. See Manage SND. -
Use SambaNova Fault Management (snfadm) at
/opt/sambaflow/bin/snfadm
for basic checks. For example:-
snfadm -l inventory | less
. -
snfadm -l fault
-
snfadm -l error
-
-
Use
snconfig
, at/opt/sambaflow/bin/snconfig
for configuration information. Runsnconfig --help
for details on arguments. -
Use
sntilestat
to learn about the status and utilization of each tile within each RDU. See Use sntilestat for performance analysis.