Get started with SambaNova Runtime

The SambaNova Runtime is responsible for all communication with SambaNova hardware. This includes hardware initialization, error handling, resource management, and interfacing with userspace processes requesting hardware resources.

Prepare your environment

To prepare your environment, you ensure that the SambaFlow package is installed and that the SambaNova Daemon is running, as follows.

Runtime is installed as part of the sambaflow package and explicit installation is not usually necessary or appropriate. The installer places all files in their correct locations and starts the SND service.

Check your SambaFlow installation

To run this example and any of the tutorial examples, you must have the sambaflow package installed.

  1. To check if the package is installed, run this command:

    • For Ubuntu Linux

      $ dpkg -s sambaflow
    • For Red Hat Enterprise Linux

      $ rpm -qi sambaflow
  2. Examine the output and look for Status: install ok installed (below the Package line).

  3. Ensure that the SambaFlow version that you are running matches the documentation you are using.

  4. If you see a message that sambaflow is not installed, contact your system administrator.

Install SambaFlow explicitly

If the sambaflow is not installed in your environment, you can install it explicitly.

See the SambaFlow release notes to determine if your OS version is supported.

Install on Ubuntu

Install SambaFlow as a Debian package on Ubuntu Linux using the following command:

$ sudo apt install -y sambaflow

If you are installing SambaFlow for the first time on this host, you must reboot the host to finish configuring the system.

Install on Red Hat Enterprise Linux

Install SambaFlow as an RPM package on Red Hat Enterprise Linux using the following command:

$ sudo dnf install -y sambaflow

After installing the package, reboot the host to finish configuring the system.

Check the SambaNova Daemon (SND) status

The SambaNova Daemon (SND) starts automatically when the host boots, or administrators can manage SND manually via systemctl as described below. When SND starts, it initializes the hardware resources and loads the driver. When SND is running, you can run applications on the hardware.

To check on the status of SND, run:

$ systemctl status snd
  1. Look for Active: active (running) in the output.

  2. If you don’t see this line, the service is not running or degraded.

Here are possible values:

Table 1. SND status
Status Description

Active: active (running)

SND has started successfully and is active. You can run models on this machine.

Active: activating (start)

SND is in the process of starting. Periodically check the service status to determine when it has fully started.

Active: inactive (dead)

SND is not active, but it is not in a failed state. It is likely that someone has stopped the service manually. Use sudo systemctl start snd to begin activation.

Active: failed

SND is in a failed state. See Runtime troubleshooting to diagnose potential causes and their fixes.

Configure SambaNova Runtime components explains how system administrators can manage SND.

Check Runtime health

If you suspect problems with Runtime, follow these steps to check Runtime health:

  1. Check the snd log to make sure there is no device error message

    Check /var/log/sambaflow/runtime/snd.log for Cannot open device file messages. If you see that message, you are probably running on an unsupported kernel version. Reinstall SambaFlow.

  2. Check if all the RDUs are enumerated on the PCIe Bus

    Check /var/log/sambaflow/runtime/snd.log for No RDUs discovered messages, and run lspci | grep 1e0d to confirm that all RDUs are enumerated (The command lists all rdus known to the system). See No RDUs on PCIE bus error for troubleshooting information.

  3. Check if SND (SambaNova Daemon) is active

    Run

    $ systemctl status snd

    The Active field should show Active: active (running). See Check the SambaNova Daemon (SND) status for details. If SND isn’t active, administrators can perform remediation tasks. See Manage SND

  4. To check with SNFADM if all hardware components are in good health, run:

    $ /opt/sambaflow/bin/snfadm -l fault

    If there are no problems, this command returns nothing. If the command lists faulty or degraded components, see SNFM Event Logs.

Examine Runtime logs

Runtime logs go to the following locations:

  • /var/log/sambaflow/runtime/snd.log - SND Logs

  • /var/log/sambaflow/runtime/sn.log - Application Logs

  • dmesg - Kernel logs (from the rdu driver)

These logs are shared between all users. Starting with release 1.16, log levels form a hierarchy where a higher log level includes all the lower levels. For example, setting the log level to WARNING includes warnings, errors, and critical messages but ignores debug and info messages.

See Change Runtime Log Levels for details on changing verbosity.

SND logs

All SND log messages go to /var/log/sambaflow/runtime/snd.log via the rsyslog service. These messages contain information about hardware initialization, event handling, and error handling. Check this log if the SND service fails or does not start successfully.

SND also contains the SambaNova Fault Management system (SNFM) logs. SNFM detects and reports issues with Runtime and the hardware.

Here are the first few lines of an SND log:

$ sudo cat /var/log/sambaflow/runtime/snd.log
[NOTICE][SND][52612]: * NUMBER OF RDUS:                 8
[NOTICE][SND][52612]: *
[NOTICE][SND][52612]: * TARGET:                         Cardinal SN10 RDU
[NOTICE][SND][52612]: * RDU INFO:                       ID: 0 SN: 0x70605130ecd64715
....

Application logs

Application logs are userspace logs generated by the models that are running on the machine (typically Python apps). If a model fails to execute, check here first. The logs contain information about the model, such as resource usage or error events that occur during execution.

Sample application logs:

$ cat /var/log/sambaflow/runtime/sn.log
[NOTICE][RSC][21757]: Orientation: FOUR_TILE
[NOTICE][RSC][21757]: -> Number of Tiles = 4
[NOTICE][RSC][21757]: --> Number of Segments = 1
[NOTICE][RSC][21757]: -----> Segment Size = 6761543552 B

dmesg(1) logs

dmesg logs are kernel logs that are generated by the RDU driver, which interfaces with the DataScale hardware. These logs typically contain information about low-level hardware events, error handling, or Linux enumeration. Check here first if you suspect a Linux or hardware issue, and if snd.log and snd.log don’t have the information you need.

$ dmesg
[RDU]: Number of RDUs Found: 8
[IOCTL]: Mapped: size 0x40000000  va 0x7f64c0000000 pa 0xfc0000000 da 0xfc0000000 num_pg 0x1
[IOCTL]: Mapped: size 0x40000000  va 0x7f6440000000 pa 0xf80000000 da 0xf80000000 num_pg 0x1
[KRM]: Tile Scrubbing Devmem Address for RDU[0] : 3ff0000000
[KRM]: Tile Scrubbed on RDU: 0 and Tile: 0

Check the status of RDUs

Before you look at the status of RDUs, it’s important you understand what state can mean:

  • Functional state. Functional state shows whether the physical hardware is healthy or unhealthy. Do not use an RDU that is functionally unhealthy (has a fault diagnosed against it). Some human interaction is usually necessary.

  • Operational state. Each RDU (but not sub-RDU components like tiles or PCIe links) has an operational state which is tracked by SND and can be retrieved using the snconfig tool’s show Node op-state command. At times, RDUs are operationally unusable, even though no problem with the functional state was found.

Here’s the command to retrieve operational state:

root@labhost42:/opt/sambaflow/bin snconfig show Node op-state

Output includes the status for each RDU, for example:

/NODE/XRDU_0/RDU_0 RDU ID: 0 Status: RDU_STATE_FUNCTIONAL
/NODE/XRDU_0/RDU_1 RDU ID: 1 Status: RDU_STATE_FUNCTIONAL
/NODE/XRDU_1/RDU_0 RDU ID: 2 Status: RDU_STATE_FUNCTIONAL
/NODE/XRDU_1/RDU_1 RDU ID: 3 Status: RDU_STATE_FUNCTIONAL
/NODE/XRDU_2/RDU_0 RDU ID: 4 Status: RDU_STATE_FUNCTIONAL
/NODE/XRDU_2/RDU_1 RDU ID: 5 Status: RDU_STATE_FUNCTIONAL
/NODE/XRDU_3/RDU_0 RDU ID: 6 Status: RDU_STATE_FUNCTIONAL
/NODE/XRDU_3/RDU_1 RDU ID: 7 Status: RDU_STATE_FUNCTIONAL
Operational State Name Interpretation Recoverability

RDU_STATE_UNAVAILABLE

RDU is not available and won’t be available until some human action is taken

Unrecoverable. Some human action is required.

RDU_STATE_DEGRADED

RDU is functional, but with some degredation (may have PCIe links down/slow, incomplete memory, or fewer tiles than expected). Consumers should check fault management and determine whether the RDU’s state is acceptable

Potentially recoverable. No human action needed unless the specific degradation is not acceptable for the user’s use case.

RDU_STATE_PENDING

Runtime is currently operating on this RDU and therefore it is unusable. Consumers should poll until it reaches a different state.

Potentially recoverable. No human action needed until state is transitioned further.

RDU_STATE_FUNCTIONAL

RDU is fully healthy and no further investigation is needed

RDU is fully healthy.

Operational state does not take into account allocation status. An RDU that is fully healthy and ready to be used, but currently used by an application shows up in the RDU_STATE_FUNCTIONAL operational state.