Get started with SambaNova Runtime

The SambaNova Runtime is responsible for all communication with SambaNova hardware. This includes hardware initialization, error handling, resource management, and interfacing with userspace processes requesting hardware resources.

Parts of Runtime

SambaNova Runtime has the following components:

  • The RDU Driver interfaces with the kernel and the hardware.

  • The SambaNova Daemon (SND) is a systemd service that handles errors and hardware initialization.

  • The Runtime libraries provide APIs used by developers and other SambaNova products that enable them to run applications on the hardware.

All three components are in the sambaflow package.

Prepare your environment

To prepare your environment, you ensure that the SambaFlow package is installed and that the SambaNova Daemon is running, as follows.

Runtime is installed as part of the sambaflow package and explicit installation is not usually necessary or appropriate. The installer places all files in their correct locations and starts the SND service.

Check your SambaFlow installation

To run this example and any of the tutorial examples, you must have the sambaflow package installed.

  1. To check if the package is installed, run this command:

    • For Ubuntu Linux

      $ dpkg -s sambaflow
    • For Red Hat Enterprise Linux

      $ rpm -qi sambaflow
  2. Examine the output and look for Status: install ok installed (below the Package line).

  3. Ensure that the SambaFlow version that you are running matches the documentation you are using.

  4. If you see a message that sambaflow is not installed, contact your system administrator.

Install SambaFlow explicitly

If the sambaflow is not installed in your environment, you can install it explicitly.

See the SambaFlow release notes to determine if your OS version is supported.

Install on Ubuntu

Install SambaFlow as a Debian package on Ubuntu Linux using the following command:

$ sudo apt install -y sambaflow

If you are installing SambaFlow for the first time on this host, you must reboot the host to finish configuring the system.

Install on Red Hat Enterprise Linux

Install SambaFlow as an RPM package on Red Hat Enterprise Linux using the following command:

$ sudo dnf install -y sambaflow

After installing the package, reboot the host to finish configuring the system.

Check the SambaNova Daemon status

Before running the examples make sure the SND (SambaNova Daemon) service is running. You cannot run the examples unless SND is running.

  1. To check if SND is running, run this command:

    $ systemctl status snd
  2. Look for Active: active (running) in the output.

  3. If you don’t see this line, the service is not running or degraded.

See SambaNova Daemon (SND) for more information.

Runtime component overview

After successful installation of Runtime, you can access the set of tools and logs that are discussed in this documentation.

Some tools and logs are for administrators, while others support developers find causes for problems during model runs.
Table 1. Runtime components
Component Description See

SambaNova daemon (SND)

The SambaNova daemon (SND) is running on the DataScale SN30-H host module and manages several critical pieces of the SambaNova operation.

SambaNova Daemon (SND)

snconfig tool

The SambaNova Configuration (snconfig) tool displays, queries, configures and manages system resources on a DataScale system.

Run snconfig --help for details.

SambaNova Fault Management (SNFADM) tool

The SambaNova Fault Management (SNFM) framework supports reporting, diagnosing, and analyzing the system error and fault events associated with a DataScale system.

SambaNova fault management (SNFM)

SambaNova Slurm plugin

The Slurm plugin supports using Slurm to manage SambaNova hardware resources.

Set up your SambaNova environment to use Slurm, Use Slurm with SambaNova

SambaNova logs

Several logs are available. You can configure log levels.

Change runtime log levels

SambaNova Daemon (SND)

The SambaNova Daemon (SND) starts automatically when the host boots, or administrators can manage SND manually via systemctl as described below. When SND starts, it initializes the hardware resources and loads the driver. When SND is running, you can run applications on the hardware.

To check on the status of SND, run:

$ systemctl status snd

If the service started up successfully, you should see something similar to this message:

● snd.service - SN Devices Service
   Loaded: loaded (/usr/lib/systemd/system/snd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/snd.service.d
           └─override.conf
   Active: active (running) since Tue 2020-05-19 15:58:22 PDT; 10min ago
  Process: 33365 ExecStart=/usr/sbin/snd.sh start (code=exited, status=0/SUCCESS)
 Main PID: 33370 (setup.sh)
    Tasks: 12
   Memory: 3.8M
   CGroup: /system.slice/snd.service
           ├─33370 /bin/bash ./sbin/setup.sh -s
           ├─33476 /bin/bash ./sbin/setup.sh -s
           ├─33477 logger -p local5.info
           └─33478 /opt/sambaflow/bin/snd

The Active field shows the status of the service. Here are possible values:

Table 2. Runtime status
Status Description

Active: active (running)

SND has started successfully and is active. You can run models on this machine.

Active: activating (start)

SND is in the process of starting. Periodically check the service status to determine when it has fully started.

Active: inactive (dead)

SND is not active, but it is not in a failed state. It is likely that someone has stopped the service manually. Use sudo systemctl start snd to begin activation.

Active: failed

SND is in a failed state. See Runtime troubleshooting to diagnose potential causes and their fixes.

Manage SND explains how system administrators can manage SND.

snconfig tool

The snconfig (SambaNova Configuration) tool displays, queries, configures, and manages system resources on a DataScale system.

snconfig is part of the sambaflow package and can be found at /opt/sambaflow/bin/snconfig.

To see available options, run snconfig --help.

Runtime logs

Runtime logs go to the following locations:

  • /var/log/sambaflow/runtime/snd.log - SND Logs

  • /var/log/sambaflow/runtime/sn.log - Application Logs

  • dmesg - Kernel logs (from the rdu driver)

All of these logs are shared between all users.

See Change Runtime Log Levels for details on changing verbosity.

SND logs

All SND log messages go to /var/log/sambaflow/runtime/snd.log via the rsyslog service. These messages contain information about hardware initialization, event handling, and error handling. Check this log if the SND service fails or does not start successfully.

SND also contains the SambaNova Fault Management system (SNFM) logs. SNFM detects and reports issues with Runtime and the hardware.

Here are the first few lines of an SND log:

$ sudo cat /var/log/sambaflow/runtime/snd.log
[NOTICE][SND][52612]: * NUMBER OF RDUS:                 8
[NOTICE][SND][52612]: *
[NOTICE][SND][52612]: * TARGET:                         Cardinal SN10 RDU
[NOTICE][SND][52612]: * RDU INFO:                       ID: 0 SN: 0x70605130ecd64715
....

Application logs

Application logs are userspace logs generated by the models that are running on the machine (typically Python apps). If a model fails to execute, check here first. The logs contain information about the model, such as resource usage or error events that occur during execution.

Sample application logs:

$ cat /var/log/sambaflow/runtime/sn.log
[NOTICE][RSC][21757]: Orientation: FOUR_TILE
[NOTICE][RSC][21757]: -> Number of Tiles = 4
[NOTICE][RSC][21757]: --> Number of Segments = 1
[NOTICE][RSC][21757]: -----> Segment Size = 6761543552 B

dmesg(1) logs

dmesg logs are kernel logs that are generated by the RDU driver, which interfaces with the DataScale hardware. These logs typically contain information about low-level hardware events, error handling, or Linux enumeration. Check here first if you suspect a Linux or hardware issue, and if snd.log and snd.log don’t have the information you need.

$ dmesg
[RDU]: Number of RDUs Found: 8
[IOCTL]: Mapped: size 0x40000000  va 0x7f64c0000000 pa 0xfc0000000 da 0xfc0000000 num_pg 0x1
[IOCTL]: Mapped: size 0x40000000  va 0x7f6440000000 pa 0xf80000000 da 0xf80000000 num_pg 0x1
[KRM]: Tile Scrubbing Devmem Address for RDU[0] : 3ff0000000
[KRM]: Tile Scrubbed on RDU: 0 and Tile: 0

Check the status of RDUs

Before you look at the status of RDUs, it’s important you understand what state can mean:

  • Functional state. Functional state shows whether the physical hardware is healthy or unhealthy. Do not use an RDU that is functionally unhealthy (has a fault diagnosed against it). Some human interaction is usually necessary.

  • Operational state. Each RDU (but not sub-RDU components like tiles or PCIe links) has an operational state which is tracked by SND and can be retrieved using the snconfig tool’s show Node op-state command. At times, RDUs are operationally unusable, even though no problem with the functional state was found.

Here’s the command to retrieve operational state:

root@sc-labhostg8:/opt/sambaflow/bin snconfig show Node op-state

Output includes the status for each RDU, for example:

/NODE/XRDU_0/RDU_0 RDU ID: 0 Status: RDU_STATE_FUNCTIONAL
/NODE/XRDU_0/RDU_1 RDU ID: 1 Status: RDU_STATE_FUNCTIONAL
/NODE/XRDU_1/RDU_0 RDU ID: 2 Status: RDU_STATE_FUNCTIONAL
/NODE/XRDU_1/RDU_1 RDU ID: 3 Status: RDU_STATE_FUNCTIONAL
/NODE/XRDU_2/RDU_0 RDU ID: 4 Status: RDU_STATE_FUNCTIONAL
/NODE/XRDU_2/RDU_1 RDU ID: 5 Status: RDU_STATE_FUNCTIONAL
/NODE/XRDU_3/RDU_0 RDU ID: 6 Status: RDU_STATE_FUNCTIONAL
/NODE/XRDU_3/RDU_1 RDU ID: 7 Status: RDU_STATE_FUNCTIONAL
Operational State Name Interpretation Recoverability

RDU_STATE_UNAVAILABLE

RDU is not available and won’t be available until some human action is taken

Unrecoverable. Some human action is required.

RDU_STATE_DEGRADED

RDU is functional, but with some degredation (may have PCIe links down/slow, incomplete memory, or fewer tiles than expected). Consumers should check fault management and determine whether the RDU’s state is acceptable

Potentially recoverable. No human action needed unless the specific degradation is not acceptable for the user’s use case.

RDU_STATE_PENDING

Runtime is currently operating on this RDU and therefore it is unusable. Consumers should poll until it reaches a different state.

Potentially recoverable. No human action needed until state is transitioned further.

RDU_STATE_FUNCTIONAL

RDU is fully healthy and no further investigation is needed

RDU is fully healthy.

Operational state does not take into account allocation status. An RDU that is fully healthy and ready to be used, but currently used by an application shows up in the RDU_STATE_FUNCTIONAL operational state.