Use sntilestat for performance analysis
sntilestat utility helps you learn about the status and utilization of each tile within each Reconfigurable Dataflow Unit (RDU). This documentation page explains how to find information, looks at an example of using
sntilestat, and discusses briefly how to use
sntilestat CSV output to create simple visualizations with Excel.
You need a direct connection to the SambaNova host system to run
As a first step to learn about
sntilestat, review the manpage, which includes details about the arguments and some example commands.
Log in to your SambaNova system.
man sntilestatand examine the output.
This section steps through an example.
We start with an environment in which two models are running:
One very large model, running training and generating multiple checkpoints.
One very small model (logreg), running training.
We run the following command:
$ /opt/sambaflow/bin/sntilestat --skip-idle -i 1 -T --count 10
We see the following output:
Let’s have a look at the command and the output.
The command instructs
sntilestatwith these arguments:
Argument used in command Description
Show only active RDUs
Use an interval of 1 second (default is 5 seconds).
Display a timestamp for each interval.
Display 10 intervals (default is 1 interval).
Look at the manpage for additional options. For example, it might make sense to filter by user or PID.
In the output, we see the following information in the different columns:
If the PID, USER, and COMMAND columns show an active process, assume that this column shows the percentage that runs on the host.
Percentage of the tile that is currently running the job that was started with the command under COMMAND.
For the first job, the large model which runs on two RDUs, we see that all the processing is happening on the RDU (%exec is 100)
For the second job, the tiles show a low percentage of %exec, a high percentage of
%idle, and noticable percentages of
%aload. We’ll discuss under XX.
Percentage of time spent loading bit files from the PEF into the RDU for each section of the model (a more precise name might be section load).
Percentage of time spent loading per-section arguments into the RDU. As with %pload, a high percentage here suggests that the processing requirement for the model is low.
Housekeeping states. These columns are normally very low or 0.0.
OS process associated with active tile. It often makes more sense to filter by user than by PID.
OS user associated with active tile.
Command associated with active tile. In interactive mode, the command is truncated to fit your terminal window. If you save to CSV or JSON, the full command is included.
Instead of looking at the
sntilestat output in stdout, you can generate a CSV or a JSON file.
--json argument to generate a file you can feed into other tools. The JSON output consists of multiple JSON objects, one per line.
The first line is a header object that includes the platform name and any settings that are useful in interpreting the measured data.
Each subsequent line, one per iteration, consists of a JSON object that contains the measured data. + JSON output field order is undefined. New fields may be added in the future.
--csv argument to generate a CSV file that you can then use with an Excel spreadsheet. You can then examine the file or do visualizations if your spreadsheet program supports it. Here’s a screenshot of a visualization that was generated in Microsoft Excel from an
sntilestat CSV .
The engineer created the visualization for the CSV file as follows:
Checked where most time is spent (
Showed only Tile0 in the visualization because the different tiles behave very similarly for this model.
The visualization makes it possible to look for patterns. The repeated dips likely mean the host is doing something.
Could be data loader issues (especially with large vision models)
Could be output tensor data collection issues
Experimentation can then show which part of the model was responsible for the dip.
The SambaNova engineering team recommends you follow these best practices:
sntilestatonly if you have direct access to the host.
The most important information that
sntilestatreturns is the balance of RDU (%exec) and host (%idle) execution, and possibly data loader and output tensor collection issues.
Keep in mind that reported percentages are approximations, based on statistical sampling of tile states over time.
If you see high %idle and low %exec, you either have a host-side bottleneck, or the model has very little on-RDU work to do per iteration (resulting in high %pload and/or %aload in addition to low %exec).
Host-side bottlenecks are often related to data loading. If you suspect that data loading is a bottleneck, check for overloaded network or storage, or for too few data loader threads. An indication of the latter could be one or a few host CPUs becoming very busy during times when %idle increases on the RDU. The
htoptool can be useful here. Many models have an option to enable asynchronous data loaders and to control the number of associated worker threads, often with a
--num-workersor similar option. The optimal number of asynchronous worker threads is very model dependent, but is normally in the range 0 to 16.
If the model has very little on-RDU work to do per iteration, consider increasing the batch size to generate more on-RDU work per iteration. If this is not possible, consider reducing the number of tiles allocated to the model. By default, when a model is compiled, all four tiles are allocated for the model from a single RDU. If you see low %exec and high %pload or %aload, consider compiling the model to use only one or two tiles from the RDU, using the
--num-tilescompiler option. Running on fewer tiles frees up the other tiles for other processes, and reduces the overhead associated with program load (%pload) and argument load (%aload).