Get started with SambaNova Runtime
The SambaNova Runtime is responsible for all communication with SambaNova hardware. This includes hardware initialization, error handling, resource management, and interfacing with userspace processes requesting hardware resources.
Prepare your environment
To prepare your environment, you ensure that the SambaFlow package is installed and that the SambaNova Daemon is running, as follows.
Runtime is installed as part of the sambaflow package and explicit installation is not usually necessary or appropriate. The installer places all files in their correct locations and starts the SND service.
|
Check your SambaFlow installation
To run this example and any of the tutorial examples, you must have the sambaflow
package installed.
-
To check if the package is installed, run this command:
-
For Ubuntu Linux
$ dpkg -s sambaflow
-
For Red Hat Enterprise Linux
$ rpm -qi sambaflow
-
-
Examine the output and look for
Status: install ok installed
(below thePackage
line). -
Ensure that the SambaFlow version that you are running matches the documentation you are using.
-
If you see a message that
sambaflow
is not installed, contact your system administrator.
Install SambaFlow explicitly
If the sambaflow
is not installed in your environment, you can install it explicitly.
See the SambaFlow release notes to determine if your OS version is supported. |
Check the SambaNova Daemon (SND) status
The SambaNova Daemon (SND) starts automatically when the host boots, or administrators can manage SND manually via systemctl
as described below.
When SND starts, it initializes the hardware resources and loads the driver.
When SND is running, you can run applications on the hardware.
To check on the status of SND, run:
$ systemctl status snd
-
Look for
Active: active (running)
in the output. -
If you don’t see this line, the service is not running or degraded.
Here are possible values:
Status | Description |
---|---|
|
SND has started successfully and is active. You can run models on this machine. |
|
SND is in the process of starting. Periodically check the service status to determine when it has fully started. |
|
SND is not active, but it is not in a failed state. It is likely that someone has stopped the service manually. Use |
|
SND is in a failed state. See Runtime troubleshooting to diagnose potential causes and their fixes. |
Configure SambaNova Runtime components explains how system administrators can manage SND.
Check Runtime health
If you suspect problems with Runtime, follow these steps to check Runtime health:
-
Check the snd log to make sure there is no device error message
Check
/var/log/sambaflow/runtime/snd.log
for Cannot open device file messages. If you see that message, you are probably running on an unsupported kernel version. Reinstall SambaFlow. -
Check if all the RDUs are enumerated on the PCIe Bus
Check
/var/log/sambaflow/runtime/snd.log
for No RDUs discovered messages, and runlspci | grep 1e0d
to confirm that all RDUs are enumerated (The command lists all rdus known to the system). See No RDUs on PCIE bus error for troubleshooting information. -
Check if SND (SambaNova Daemon) is active
Run
$ systemctl status snd
The Active field should show
Active: active (running)
. See Check the SambaNova Daemon (SND) status for details. If SND isn’t active, administrators can perform remediation tasks. See Manage SND -
To check with SNFADM if all hardware components are in good health, run:
$ /opt/sambaflow/bin/snfadm -l fault
If there are no problems, this command returns nothing. If the command lists faulty or degraded components, see SNFM Event Logs.
Examine Runtime logs
Runtime logs go to the following locations:
-
/var/log/sambaflow/runtime/snd.log
- SND Logs -
/var/log/sambaflow/runtime/sn.log
- Application Logs -
dmesg
- Kernel logs (from therdu
driver)
These logs are shared between all users. Starting with release 1.16, log levels form a hierarchy where a higher log level includes all the lower levels. For example, setting the log level to WARNING includes warnings, errors, and critical messages but ignores debug and info messages.
See Change Runtime Log Levels for details on changing verbosity.
SND logs
All SND log messages go to /var/log/sambaflow/runtime/snd.log
via the rsyslog
service.
These messages contain information about hardware initialization, event handling, and error handling. Check this log if the SND service fails or does not start successfully.
SND also contains the SambaNova Fault Management system (SNFM) logs. SNFM detects and reports issues with Runtime and the hardware.
Here are the first few lines of an SND log:
$ sudo cat /var/log/sambaflow/runtime/snd.log
[NOTICE][SND][52612]: * NUMBER OF RDUS: 8
[NOTICE][SND][52612]: *
[NOTICE][SND][52612]: * TARGET: Cardinal SN10 RDU
[NOTICE][SND][52612]: * RDU INFO: ID: 0 SN: 0x70605130ecd64715
....
Application logs
Application logs are userspace logs generated by the models that are running on the machine (typically Python apps). If a model fails to execute, check here first. The logs contain information about the model, such as resource usage or error events that occur during execution.
Sample application logs:
$ cat /var/log/sambaflow/runtime/sn.log
[NOTICE][RSC][21757]: Orientation: FOUR_TILE
[NOTICE][RSC][21757]: -> Number of Tiles = 4
[NOTICE][RSC][21757]: --> Number of Segments = 1
[NOTICE][RSC][21757]: -----> Segment Size = 6761543552 B
dmesg(1) logs
dmesg
logs are kernel logs that are generated by the RDU driver, which interfaces with the DataScale hardware. These logs typically contain information about low-level hardware events, error handling, or Linux enumeration.
Check here first if you suspect a Linux or hardware issue, and if snd.log
and snd.log
don’t have the information you need.
$ dmesg
[RDU]: Number of RDUs Found: 8
[IOCTL]: Mapped: size 0x40000000 va 0x7f64c0000000 pa 0xfc0000000 da 0xfc0000000 num_pg 0x1
[IOCTL]: Mapped: size 0x40000000 va 0x7f6440000000 pa 0xf80000000 da 0xf80000000 num_pg 0x1
[KRM]: Tile Scrubbing Devmem Address for RDU[0] : 3ff0000000
[KRM]: Tile Scrubbed on RDU: 0 and Tile: 0
Check the status of RDUs
Before you look at the status of RDUs, it’s important you understand what state can mean:
-
Functional state. Functional state shows whether the physical hardware is healthy or unhealthy. Do not use an RDU that is functionally unhealthy (has a fault diagnosed against it). Some human interaction is usually necessary.
-
Operational state. Each RDU (but not sub-RDU components like tiles or PCIe links) has an operational state which is tracked by SND and can be retrieved using the snconfig tool’s show Node op-state command. At times, RDUs are operationally unusable, even though no problem with the functional state was found.
Here’s the command to retrieve operational state:
root@labhost42:/opt/sambaflow/bin snconfig show Node op-state
Output includes the status for each RDU, for example:
/NODE/XRDU_0/RDU_0 RDU ID: 0 Status: RDU_STATE_FUNCTIONAL /NODE/XRDU_0/RDU_1 RDU ID: 1 Status: RDU_STATE_FUNCTIONAL /NODE/XRDU_1/RDU_0 RDU ID: 2 Status: RDU_STATE_FUNCTIONAL /NODE/XRDU_1/RDU_1 RDU ID: 3 Status: RDU_STATE_FUNCTIONAL /NODE/XRDU_2/RDU_0 RDU ID: 4 Status: RDU_STATE_FUNCTIONAL /NODE/XRDU_2/RDU_1 RDU ID: 5 Status: RDU_STATE_FUNCTIONAL /NODE/XRDU_3/RDU_0 RDU ID: 6 Status: RDU_STATE_FUNCTIONAL /NODE/XRDU_3/RDU_1 RDU ID: 7 Status: RDU_STATE_FUNCTIONAL
Operational State Name | Interpretation | Recoverability |
---|---|---|
RDU_STATE_UNAVAILABLE |
RDU is not available and won’t be available until some human action is taken |
Unrecoverable. Some human action is required. |
RDU_STATE_DEGRADED |
RDU is functional, but with some degredation (may have PCIe links down/slow, incomplete memory, or fewer tiles than expected). Consumers should check fault management and determine whether the RDU’s state is acceptable |
Potentially recoverable. No human action needed unless the specific degradation is not acceptable for the user’s use case. |
RDU_STATE_PENDING |
Runtime is currently operating on this RDU and therefore it is unusable. Consumers should poll until it reaches a different state. |
Potentially recoverable. No human action needed until state is transitioned further. |
RDU_STATE_FUNCTIONAL |
RDU is fully healthy and no further investigation is needed |
RDU is fully healthy. |
Operational state does not take into account allocation status. An RDU that is fully healthy and ready to be used, but currently used by an application shows up in the RDU_STATE_FUNCTIONAL operational state.