Best practices

The following guidelines from our Customer Engineering department might help improve running your model on RDU.

General best practices

  • Ensure that data is properly formatted for the model you’re working with. Data issues are the most common cause for problems with compiling or training a model.

    • We have a public GitHub repository External link with two scripts for pretraining data creation, pipeline.py and data_prep.py.

    • You might have to take additional steps to make sure that your data work for the model you’re trying to compile and run. General data preparation for AI is beyond the scope of this documentation.

  • Our architecture works natively in BF16, so BF16 supports the greatest number of operators. To avoid extra round trips to the host:

    • Use BF16 in your code where possible.

    • Use BF16 for checkpoints.

    • Save preprocessed datasets in BF16. (Most NLP datasets use integers).

  • SambaFlow supports a limited set of operators. For unsupported operators, you have these options:

Compilation best practices

Running your model

  • Experiment with batch sizes to determine the most efficient size. Models that include a --grad-accummulation-steps argument support changing batch size in multiples of a base batch size (--b-size argument).

  • Experiment with --num-workers to determine how to achieve the lowest host CPU overhead for dataloaders.

  • If you have noticeable transfer time between host and RDU, try including --tensormem ddr.

Tools

To examine where your model spends time and potentially improve performance:

  • Use the SambaTune reports or GUI. See the SambaTune Documentation.

  • Use the Exporter for Prometheus, introduced in Release 1.19. See the Runtime release notes for additional information. Runtime release notes.

  • Run systemctl status snd to check the SambaNova daemon status if you see unexpected errors. See Manage SND.

  • Use SambaNova Fault Management (snfadm) at /opt/sambaflow/bin/snfadm for basic checks. For example:

    • snfadm -l inventory | less.

    • snfadm -l fault

    • snfadm -l error

  • Use snconfig, at /opt/sambaflow/bin/snconfig for configuration information. Run snconfig --help for details on arguments.

  • Use sntilestat to learn about the status and utilization of each tile within each RDU. See Use sntilestat for performance analysis.