Use sntilestat for performance analysis

The sntilestat utility helps you learn about the status and utilization of each tile within each Reconfigurable Dataflow Unit (RDU). This documentation page explains how to find information, looks at an example of using sntilestat, and discusses briefly how to use sntilestat CSV output to create simple visualizations with Excel.

You need a direct connection to the SambaNova host system to run sntilestat effectively. If you don’t have direct access (for example, you’re targeting Slurm), the output to stdout isn’t useful.

Examine the sntilestat manpage

As a first step to learn about sntilestat, review the manpage, which includes details about the arguments and some example commands.

Log in to your SambaNova system.
Run man sntilestat and examine the output.

Example

This section steps through an example.

We start with an environment in which two models are running:

One very large model, running training and generating multiple checkpoints.
One very small model (logreg), running training.

We run the following command:

$ /opt/sambaflow/bin/sntilestat --skip-idle -i 1 -T --count 10

We see the following output:

Sntilestat example with two very different models running

Let’s have a look at the command and the output.

The command instructs sntilestat with these arguments:

Argument used in command	Description
--skip-idle	Show only active RDUs
-i 1	Use an interval of 1 second (default is 5 seconds).
-T	Display a timestamp for each interval.
--count 10	Display 10 intervals (default is 1 interval).

Argument used in command

Description

--skip-idle

Show only active RDUs

-i 1

Use an interval of 1 second (default is 5 seconds).

-T

Display a timestamp for each interval.

--count 10

Display 10 intervals (default is 1 interval).

Look at the manpage for additional options. For example, it might make sense to filter by user or PID.

In the output, we see the following information in the different columns:

Column Description

Column	Description
`%idle`	If the PID, USER, and COMMAND columns show an active process, assume that this column shows the percentage that runs on the host.
`%exec`	Percentage of the tile that is currently running the job that was started with the command under COMMAND. For the first job, the large model which runs on two RDUs, we see that all the processing is happening on the RDU (%exec is 100) For the second job, the tiles show a low percentage of %exec, a high percentage of `%idle`, and noticable percentages of `%pload` and `%aload`. We’ll discuss under XX.
`%pload` (program load)	Percentage of time spent loading bit files from the PEF into the RDU for each section of the model (a more precise name might be section load).
`%aload` (argument load)	Percentage of time spent loading per-section arguments into the RDU. As with %pload, a high percentage here suggests that the processing requirement for the model is low.
`%chkpt` and `%quiesce`	Housekeeping states. These columns are normally very low or 0.0.
PID	OS process associated with active tile. It often makes more sense to filter by user than by PID.
USER	OS user associated with active tile.
COMMAND	Command associated with active tile. In interactive mode, the command is truncated to fit your terminal window. If you save to CSV or JSON, the full command is included.

%idle

If the PID, USER, and COMMAND columns show an active process, assume that this column shows the percentage that runs on the host.

%exec

Percentage of the tile that is currently running the job that was started with the command under COMMAND.

For the first job, the large model which runs on two RDUs, we see that all the processing is happening on the RDU (%exec is 100)
For the second job, the tiles show a low percentage of %exec, a high percentage of %idle, and noticable percentages of %pload and %aload. We’ll discuss under XX.

%pload (program load)

Percentage of time spent loading bit files from the PEF into the RDU for each section of the model (a more precise name might be section load).

%aload (argument load)

Percentage of time spent loading per-section arguments into the RDU. As with %pload, a high percentage here suggests that the processing requirement for the model is low.

%chkpt and %quiesce

Housekeeping states. These columns are normally very low or 0.0.

PID

OS process associated with active tile. It often makes more sense to filter by user than by PID.

USER

OS user associated with active tile.

COMMAND

Command associated with active tile. In interactive mode, the command is truncated to fit your terminal window. If you save to CSV or JSON, the full command is included.

Other ways of viewing sntilestat output

Instead of looking at the sntilestat output in stdout, you can generate a CSV or a JSON file.

JSON output

Use the --json argument to generate a file you can feed into other tools. The JSON output consists of multiple JSON objects, one per line.

The first line is a header object that includes the platform name and any settings that are useful in interpreting the measured data.
Each subsequent line, one per iteration, consists of a JSON object that contains the measured data. + JSON output field order is undefined. New fields may be added in the future.

CSV output

Use the --csv argument to generate a CSV file that you can then use with an Excel spreadsheet. You can then examine the file or do visualizations if your spreadsheet program supports it. Here’s a screenshot of a visualization that was generated in Microsoft Excel from an sntilestat CSV .

The engineer created the visualization for the CSV file as follows:

Checked where most time is spent (%exec)
Showed only Tile0 in the visualization because the different tiles behave very similarly for this model.

The visualization makes it possible to look for patterns. The repeated dips likely mean the host is doing something.

Could be data loader issues (especially with large vision models)
Could be output tensor data collection issues

Experimentation can then show which part of the model was responsible for the dip.

Best practices

The SambaNova engineering team recommends you follow these best practices:

Run sntilestat only if you have direct access to the host.
The most important information that sntilestat returns is the balance of RDU (%exec) and host (%idle) execution, and possibly data loader and output tensor collection issues.
Keep in mind that reported percentages are approximations, based on statistical sampling of tile states over time.
If you see high %idle and low %exec, you either have a host-side bottleneck, or the model has very little on-RDU work to do per iteration (resulting in high %pload and/or %aload in addition to low %exec).
- Host-side bottlenecks are often related to data loading. If you suspect that data loading is a bottleneck, check for overloaded network or storage, or for too few data loader threads. An indication of the latter could be one or a few host CPUs becoming very busy during times when %idle increases on the RDU. The htop tool can be useful here. Many models have an option to enable asynchronous data loaders and to control the number of associated worker threads, often with a --num-workers or similar option. The optimal number of asynchronous worker threads is very model dependent, but is normally in the range 0 to 16.
- If the model has very little on-RDU work to do per iteration, consider increasing the batch size to generate more on-RDU work per iteration. If this is not possible, consider reducing the number of tiles allocated to the model. By default, when a model is compiled, all four tiles are allocated for the model from a single RDU. If you see low %exec and high %pload or %aload, consider compiling the model to use only one or two tiles from the RDU, using the --num-tiles compiler option. Running on fewer tiles frees up the other tiles for other processes, and reduces the overhead associated with program load (%pload) and argument load (%aload).