Get started with SambaNova Runtime
The SambaNova Runtime is responsible for all communication with SambaNova hardware. This includes hardware initialization, error handling, resource management, and interfacing with userspace processes requesting hardware resources.
Parts of Runtime
SambaNova Runtime has the following components:
-
The RDU Driver interfaces with the kernel and the hardware.
-
The SambaNova Daemon (SND) is a systemd service that handles errors and hardware initialization.
-
Public interfaces include the SNML APIs and several CLI tools.
-
The SambaRuntime library, which provides APIs that developers and other SambaNova products use to run applications on the hardware.
All components are in the sambaflow
package. See Runtime component overview.
Prepare your environment
To prepare your environment, you ensure that the SambaFlow package is installed and that the SambaNova Daemon is running, as follows.
Runtime is installed as part of the sambaflow package and explicit installation is not usually necessary or appropriate. The installer places all files in their correct locations and starts the SND service.
|
Check your SambaFlow installation
To run this example and any of the tutorial examples, you must have the sambaflow
package installed.
-
To check if the package is installed, run this command:
-
For Ubuntu Linux
$ dpkg -s sambaflow
-
For Red Hat Enterprise Linux
$ rpm -qi sambaflow
-
-
Examine the output and look for
Status: install ok installed
(below thePackage
line). -
Ensure that the SambaFlow version that you are running matches the documentation you are using.
-
If you see a message that
sambaflow
is not installed, contact your system administrator.
Install SambaFlow explicitly
If the sambaflow
is not installed in your environment, you can install it explicitly.
See the SambaFlow release notes to determine if your OS version is supported. |
Check the SambaNova Daemon status
Before running the examples make sure the SND (SambaNova Daemon) service is running. You cannot run the examples unless SND is running.
-
To check if SND is running, run this command:
$ systemctl status snd
-
Look for
Active: active (running)
in the output. -
If you don’t see this line, the service is not running or degraded.
See SambaNova Daemon (SND) for more information.
Runtime component overview
After successful installation of Runtime, you can access the set of tools and logs that are discussed in this documentation.
Some tools and logs are for administrators, while others support developers find causes for problems during model runs. |
Component | Description | See |
---|---|---|
SambaNova daemon (SND) |
The SambaNova daemon (SND) is running on the DataScale SN30-H host module and manages several critical pieces of the SambaNova operation. |
|
snconfig tool |
The SambaNova Configuration (snconfig) tool displays, queries, configures and manages system resources on a DataScale system. |
Run |
sntilestat tool |
Displays the status and utilization of each tile within each Reconfigurable Dataflow Unit (RDU). |
Run |
SambaNova Fault Management (SNFADM) tool |
The SambaNova Fault Management (SNFM) framework supports reporting, diagnosing, and analyzing the system error and fault events associated with a DataScale system. |
|
SambaNova Slurm plugin |
The Slurm plugin supports using Slurm to manage SambaNova hardware resources. |
|
SambaNova logs |
Several logs are available. You can configure log levels. |
|
SNML APIs |
SambaNova Management Layer (SNML) discusses 2 APIs you can use for retrieving information about SambaNova Runtime and making changes programmatically. |
|
SambaNova Daemon (SND)
The SambaNova Daemon (SND) starts automatically when the host boots, or administrators can manage SND manually via systemctl
as described below.
When SND starts, it initializes the hardware resources and loads the driver.
When SND is running, you can run applications on the hardware.
To check on the status of SND, run:
$ systemctl status snd
If the service started up successfully, you should see something similar to this message:
● snd.service - SN Devices Service Loaded: loaded (/usr/lib/systemd/system/snd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/snd.service.d └─override.conf Active: active (running) since Tue 2020-05-19 15:58:22 PDT; 10min ago Process: 33365 ExecStart=/usr/sbin/snd.sh start (code=exited, status=0/SUCCESS) Main PID: 33370 (setup.sh) Tasks: 12 Memory: 3.8M CGroup: /system.slice/snd.service ├─33370 /bin/bash ./sbin/setup.sh -s ├─33476 /bin/bash ./sbin/setup.sh -s ├─33477 logger -p local5.info └─33478 /opt/sambaflow/bin/snd
The Active
field shows the status of the service. Here are possible values:
Status | Description |
---|---|
|
SND has started successfully and is active. You can run models on this machine. |
|
SND is in the process of starting. Periodically check the service status to determine when it has fully started. |
|
SND is not active, but it is not in a failed state. It is likely that someone has stopped the service manually. Use |
|
SND is in a failed state. See Runtime troubleshooting to diagnose potential causes and their fixes. |
Manage SND explains how system administrators can manage SND.
snconfig tool
The snconfig (SambaNova Configuration) tool displays, queries, configures, and manages system resources on a DataScale system.
snconfig
is part of the sambaflow
package and can be found at /opt/sambaflow/bin/snconfig
.
To see available options, run snconfig --help
.
sntilestat tool
The sntilestat
tool displays the status and utilization of each tile within each Reconfigurable Dataflow Unit (RDU) in each XRDU chassis in the system.
sntilestat
is part of the sambaflow
package and can be found at /opt/sambaflow/bin/sntilestat
.
For an introduction and some examples for sntilestat
, run man sntilestat
.
Runtime logs
Runtime logs go to the following locations:
-
/var/log/sambaflow/runtime/snd.log
- SND Logs -
/var/log/sambaflow/runtime/sn.log
- Application Logs -
dmesg
- Kernel logs (from therdu
driver)
These logs are shared between all users. Starting with release 1.16, log levels form a hierarchy where a higher log level includes all the lower levels. For example, setting the log level to WARNING would log warnings, errors, and critical messages but ignore debug and info messages.
See Change Runtime Log Levels for details on changing verbosity.
SND logs
All SND log messages go to /var/log/sambaflow/runtime/snd.log
via the rsyslog
service.
These messages contain information about hardware initialization, event handling, and error handling. Check this log if the SND service fails or does not start successfully.
SND also contains the SambaNova Fault Management system (SNFM) logs. SNFM detects and reports issues with Runtime and the hardware.
Here are the first few lines of an SND log:
$ sudo cat /var/log/sambaflow/runtime/snd.log
[NOTICE][SND][52612]: * NUMBER OF RDUS: 8
[NOTICE][SND][52612]: *
[NOTICE][SND][52612]: * TARGET: Cardinal SN10 RDU
[NOTICE][SND][52612]: * RDU INFO: ID: 0 SN: 0x70605130ecd64715
....
Application logs
Application logs are userspace logs generated by the models that are running on the machine (typically Python apps). If a model fails to execute, check here first. The logs contain information about the model, such as resource usage or error events that occur during execution.
Sample application logs:
$ cat /var/log/sambaflow/runtime/sn.log
[NOTICE][RSC][21757]: Orientation: FOUR_TILE
[NOTICE][RSC][21757]: -> Number of Tiles = 4
[NOTICE][RSC][21757]: --> Number of Segments = 1
[NOTICE][RSC][21757]: -----> Segment Size = 6761543552 B
dmesg(1) logs
dmesg
logs are kernel logs that are generated by the RDU driver, which interfaces with the DataScale hardware. These logs typically contain information about low-level hardware events, error handling, or Linux enumeration.
Check here first if you suspect a Linux or hardware issue, and if snd.log
and snd.log
don’t have the information you need.
$ dmesg
[RDU]: Number of RDUs Found: 8
[IOCTL]: Mapped: size 0x40000000 va 0x7f64c0000000 pa 0xfc0000000 da 0xfc0000000 num_pg 0x1
[IOCTL]: Mapped: size 0x40000000 va 0x7f6440000000 pa 0xf80000000 da 0xf80000000 num_pg 0x1
[KRM]: Tile Scrubbing Devmem Address for RDU[0] : 3ff0000000
[KRM]: Tile Scrubbed on RDU: 0 and Tile: 0
Check the status of RDUs
Before you look at the status of RDUs, it’s important you understand what state can mean:
-
Functional state. Functional state shows whether the physical hardware is healthy or unhealthy. Do not use an RDU that is functionally unhealthy (has a fault diagnosed against it). Some human interaction is usually necessary.
-
Operational state. Each RDU (but not sub-RDU components like tiles or PCIe links) has an operational state which is tracked by SND and can be retrieved using the snconfig tool’s show Node op-state command. At times, RDUs are operationally unusable, even though no problem with the functional state was found.
Here’s the command to retrieve operational state:
root@sc-labhostg8:/opt/sambaflow/bin snconfig show Node op-state
Output includes the status for each RDU, for example:
/NODE/XRDU_0/RDU_0 RDU ID: 0 Status: RDU_STATE_FUNCTIONAL /NODE/XRDU_0/RDU_1 RDU ID: 1 Status: RDU_STATE_FUNCTIONAL /NODE/XRDU_1/RDU_0 RDU ID: 2 Status: RDU_STATE_FUNCTIONAL /NODE/XRDU_1/RDU_1 RDU ID: 3 Status: RDU_STATE_FUNCTIONAL /NODE/XRDU_2/RDU_0 RDU ID: 4 Status: RDU_STATE_FUNCTIONAL /NODE/XRDU_2/RDU_1 RDU ID: 5 Status: RDU_STATE_FUNCTIONAL /NODE/XRDU_3/RDU_0 RDU ID: 6 Status: RDU_STATE_FUNCTIONAL /NODE/XRDU_3/RDU_1 RDU ID: 7 Status: RDU_STATE_FUNCTIONAL
Operational State Name | Interpretation | Recoverability |
---|---|---|
RDU_STATE_UNAVAILABLE |
RDU is not available and won’t be available until some human action is taken |
Unrecoverable. Some human action is required. |
RDU_STATE_DEGRADED |
RDU is functional, but with some degredation (may have PCIe links down/slow, incomplete memory, or fewer tiles than expected). Consumers should check fault management and determine whether the RDU’s state is acceptable |
Potentially recoverable. No human action needed unless the specific degradation is not acceptable for the user’s use case. |
RDU_STATE_PENDING |
Runtime is currently operating on this RDU and therefore it is unusable. Consumers should poll until it reaches a different state. |
Potentially recoverable. No human action needed until state is transitioned further. |
RDU_STATE_FUNCTIONAL |
RDU is fully healthy and no further investigation is needed |
RDU is fully healthy. |
Operational state does not take into account allocation status. An RDU that is fully healthy and ready to be used, but currently used by an application shows up in the RDU_STATE_FUNCTIONAL operational state.