Troubleshoot SambaNova Runtime

Developers and administrators sometimes experience problems when they’re using their SambaNova environment. This page helps with finding the reasons and resolving the problem.

The page includes:

A table of frequently asked questions with links to answers.
Common errors and how to resolve them.
Troubleshooting tasks, such as resetting RDUs.

Frequently asked questions

Question See

Question	See
Where do I find logs?	Runtime logs are at `/var/log/sambaflow/runtime`. See Runtime logs for details.
How do I increase logging verbosity?	See Change log levels.
What are the tools an administrator can use?	Tools include snconfig, SambaNova Fault Management (SNFADM), and Slurm
How can I programmatically interact with SambaNova Runtime?	The SNML API allows all users to retrieve information, and the SNML admin API supports administrative tasks. SambaNova Management Layer (SNML) discusses the 2 APIs and includes code examples.
Where can I find SNFM logs?	Because SNFM is a part of SND, you can find SNFM log messages in `/var/log/sambaflow/runtime/snd.log`.
When and how do I do an RDU reset?	The Runtime components does automatic RDU recovery in many situations. However, at times, administrators have to Reset RDUs.
How do I gracefully shut down and power up my rack?	Under rare circumstances, system administrators might have to power cycle the rack. See Gracefully shutting down the DataScale SN30 rack and Power on process overview the DataScale system administration document.

Where do I find logs?

Runtime logs are at /var/log/sambaflow/runtime. See Runtime logs for details.

How do I increase logging verbosity?

See Change log levels.

What are the tools an administrator can use?

Tools include snconfig, SambaNova Fault Management (SNFADM), and Slurm

How can I programmatically interact with SambaNova Runtime?

The SNML API allows all users to retrieve information, and the SNML admin API supports administrative tasks. SambaNova Management Layer (SNML) discusses the 2 APIs and includes code examples.

Where can I find SNFM logs?

Because SNFM is a part of SND, you can find SNFM log messages in /var/log/sambaflow/runtime/snd.log.

When and how do I do an RDU reset?

The Runtime components does automatic RDU recovery in many situations. However, at times, administrators have to Reset RDUs.

How do I gracefully shut down and power up my rack?

Under rare circumstances, system administrators might have to power cycle the rack. See Gracefully shutting down the DataScale SN30 rack and Power on process overview the DataScale system administration document.

Troubleshoot your installation

By default, the SambaNova Runtime package is included in your environment. SambaNova periodically releases new versions. Install the most recent version of the complete SambaFlow package (which includes Runtime) when the package becomes available so that your OS and other Runtime components remain compatible with the SambaNova hardware.

When there are problems with your installation, you might see an error like the following.

Nov  1 17:28:54 host1 python3[3654]:  File "__init__.py", line 87, in init pysnstat.__init__
Nov  1 17:28:54 host1 python3[3654]:  File "pysnstat.py", line 42, in pysnstat.pysnstat.get_platform_module
Nov  1 17:28:54 host1 python3[3654]: RuntimeError: SNStat: unknown platform Unknown Platform

Administrators can follow these steps to resolve the issue:

You need superuser privileges for some of the commands.

Make sure the package is installed.

For an Ubuntu system, run these commands:

apt list --installed | grep sambanova-runtime
apt list --installed | grep sambaflow

Verify that the command returns something like the following:

sambanova-runtime-diag/now 1.14.4-2210171355 amd64 [installed,upgradable to: 1.14.4-2211011246]
sambanova-runtime-mlnx/now 1.14.4-2210171355 amd64 [installed,upgradable to: 1.14.4-2211011246]
sambanova-runtime-scripts/focal,now 1.14.2-2209130703 all [installed,upgradable to: 1.14.4-2211011246]
sambanova-runtime/now 1.14.4-2210171355 amd64 [installed,upgradable to: 1.14.4-2211011246]

For RHEL, run this command:

rpm -qi sambaflow

Verify that the command returns something like the following:

Name        : sambaflow
Version     : 1.14.3
Release     : 8.el8
Architecture: x86_64
Install Date: Mon 14 Nov 2022 05:38:50 PM EST
Group       : SambaFlow
Size        : 0
License     : (c) SambaNova Systems
Signature   : (none)
Source RPM  : sambaflow-1.14.3-8.el8.src.rpm
Build Date  : Sat 05 Nov 2022 02:19:06 PM EDT
Build Host  : sc-c15_DOCKER
Relocations : /opt/sambaflow
Vendor      : SambaNova
Summary     : SambaNova SambaFlow

Make sure your OS/kernel version is supported. See the Runtime Release Notes.
Make sure the system is booted into an appropriate kernel version/OS pair.

Check SDN status. You should see something like this:

root@sn-host15:~# systemctl status snd
● snd.service - SN Devices Service
     Loaded: loaded (/lib/systemd/system/snd.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/snd.service.d
             └─override.conf
     Active: active (running) since Mon 2022-10-17 15:12:14 PDT; 2 weeks 2 days ago
   Main PID: 86552 (snd)
      Tasks: 9 (limit: 629145)
     Memory: 54.2M
     CGroup: /system.slice/snd.service
             └─86552 /opt/sambaflow/bin/snd

Make sure the RDU driver is loaded:
```
lsmod | grep rdu
```
For the RDU driver, you might see rdu_sn20, rdu_sn30, or similar for newer hardware architectures.

If SND is running but the driver is not loaded, the wrong runtime package might be installed, and SND might have failed over to standalone mode.

Verify that the driver is loaded by looking for messages like the following in snd.log:

Oct 31 21:18:56 host1 [root][snd][1823357]: [NOTICE][SND][meta hostname=host1 tid=1823357] No RDU found.
Oct 31 21:18:56 host1 [root][snd][1823357]: [NOTICE][SNML_SERVER][meta hostname=host1 tid=1823357] SND SNML server started at 127.0.0.1:50053
Oct 31 21:18:57 host1 [root][snd][1823357]: [ERR][SNFM][meta hostname=host1 tid=1823357] Sambaflow package platform (DataScale SN10) does not match physical RDU platform
Oct 31 21:18:57 host1 [root][snd][1823357]: [ERR][SNFM][meta hostname=host1 tid=1823357] Platform name unknown

For example, a message Sambaflow package platform (DataScale SN10) does not match physical RDU platform means that the package you have installed (sn10) does not match the physical platform you’re on.

Developer: Troubleshoot common errors

This section lists errors that developers might encounter and that they can address even if they don’t have administrator or superuse privileges. Errors are shown in the console and included in log files:

When you compile or run a model or perform other management tasks, most errors are shown on the console.
Errors are also added to one of the logs in /var/log/sambaflow/runtime.

PEF Version error

Error

[ERR][LIB]: sn_topology_get_version: PEF Version (Expected = [0.7.0], Actual = [0.7.2])
[ERR][LIB]: sn_create_session: PEF version mismatch. Abort...

Cause

PEFs (Processor Executable Format) files are only forward compatible. If you see this message, the PEF version of the installed Runtime (actual) and the version of the compiled PEF (expected) don’t match.

When you install SambaFlow, you know install the complete software stack. In earlier versions of the software, it was possible to install components separately.

Solution

To identify the compatible PEF for the installed runtime, run the following commands on the host.

Red Hat

$ rpm -q --provides sambanova-runtime | grep PEF
PEF = 0.13.0 # This is the compatible PEF version for the installed sambanova-runtime

Ubuntu

$ apt show sambanova-runtime | grep -i pef
Provides: pef (= 0.13.0) # This is the compatible PEF version for the installed sambanova-runtime

If that version does not match the sambanova-runtime version, reinstall the whole package and compile and run your application again.

Runtime errors

When you run your application, errors are shown as an exception. The /var/log/sambaflow/runtime/sn.log file shows a resource allocation error, and the Python runtime exception backtrace shows ResourceDB Alloc Failure.

This section lists common runtime errors

Device ISA not supported error

Error

[ERR][DEV][22489]: Unable to allocate resources requested by graph. Reason: Bad address
[ERR][LIB][22489]: Resource Allocation failed: Device ISA not supported
[ERR][LIB][22489]: Unable to allocate requested rsc

Cause

The RDU architecture that the PEF was compiled for does not match the RDU architecture on the system.

Solution

This is a rare error that can occur if your model was compiled on a different version of the hardware.

Recompile the model on the current version of the hardware.

No Tiles Available error

Error

[ERR][DEV][5383]: Unable to allocate resources requested by graph. Reason: Not Enough Resources available
Requested Node(s): 1 {2H} Available RDU(s): 8 {0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1}
Requested Memory Per Node (B): {8256}
Total: Available Memory Blocks Per Node(GB):
383: {95, 95, 95, 95}
383: {95, 95, 95, 95}
383: {95, 95, 95, 95}
383: {95, 95, 95, 95}
383: {95, 95, 95, 95}
383: {95, 95, 95, 95}
383: {95, 95, 95, 95}
383: {95, 95, 95, 95}

[ERR][LIB][5383]: Resource Allocation failed: No Tile(s) Available
[ERR][LIB][5383]: Unable to allocate requested rsc

Cause

This error indicates that there aren’t enough hardware resources to run your application.

Solution

Here’s what you can do:

If you know that others are also running models on this system, the problem might be that their models are using too many of the resources. Wait, and try again later.
Ask your system administrator to check if the system is healthy and to run sntilestat to see the status of the hardware.

No Device Memory Available error

Error

[ERR][DEV][7990]: Unable to allocate resources requested by graph. Reason: Not Enough Resources available
Requested Node(s): 1 {2H} Available RDU(s): 8 {0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf}
Requested Memory Per Node (B): {8256}
Total: Available Memory Blocks Per Node(GB):
0: {0, 0, 0, 0}
0: {0, 0, 0, 0}
0: {0, 0, 0, 0}
0: {0, 0, 0, 0}
0: {0, 0, 0, 0}
0: {0, 0, 0, 0}
0: {0, 0, 0, 0}
0: {0, 0, 0, 0}

[ERR][LIB][7990]: Resource Allocation failed: No Device Memory Available
[ERR][LIB][7990]: Unable to allocate requested rsc

Cause

The device resource allocation failures are accompanied by a summary that displays:

The resources requested by the model
The available resources on the system

Information is available for nodes and memory:

Nodes: The error shows the number of nodes requested by the model and the corresponding orientation. The available RDUs on the system is accompanied by the tile bitmask available on each of them. The least 4 bits represent the availability of the 4 tiles on the RDU. For example, a bitmask of 0x5 means that tiles 0 and 2 are available.
Memory: The requested memory per node is shown in bytes. Each line in the Available Memory section of the summary corresponds to one RDU. Total is followed by how the memory is partitioned and the memory available for that partition.

If you set rdu_log_level=255, the details of allocation options that Runtime tried are available in the kernel logs (use dmesg). See Change kernel log levels.

Solution

Wait for a while and try your command again. How long you have to wait depends on the types of job that are customarily submitted in your environment. Ask the system administrator to determine if other jobs are using all system resources and to verify system health.

Symbol lookup failure error

Error

Log ID initialized to: [user][app][pid] at /var/log/sambaflow/runtime/sn.log
2020-5-17 19:36:57 <hostname>: [ERR][MAP][pid]: Unable to Find Key: xxx
2020-5-17 19:36:57 <hostname>: [ERR][SYM][pid]: Invalid symbol id: -1
2020-5-17 19:36:57 <hostname>: [ERR][LIB][pid]: KeyError: xxx
...
RuntimeError: Unable to set tensors property: Symbol Lookup Failure

Cause

Tensor KeyError occurs if a PEF is compiled with certain symbol name and at runtime, and then the application passes a different symbol name.

Solution

Check the samba logs to understand the expectation for the tensor name. You can find those logs in out/<pef_name>/<pef_name>.samba.log and on the console.

Check the compiler logs to understand expectation for the tensor name. The two names should match.

Developer: Use Graph CLI to debug applications

In most cases, developers who use SambaFlow take advantage of Python debugging techniques when they see problems with their application. In rare situations, developers might use Graph CLI to explore. Graph CLI shows what’s happening close to the hardware.

How to run Graph CLI

Run your application with the environment variable SNCLI_SERVER.

SNCLI_SERVER="localhost:50052" python /path/to/app <args>

Launch a CLI client in another window.
```
$ /opt/sambaflow/diag/run_sncli.py
```
Connect the CLI client to a graph.
```
$ connect graph localhost:50052
```

Get graph information.

$ show
$ dump symbol <symbol name> vsnode <ID>

Disconnect the CLI client from the graph.
```
$ disconnect
```

Graph CLI debug commands

Command	Comments
connect	Connect to a remote target
disconnect	Disconnect from a remote target
exit	Exit from the CLI
dump symbol <symbol name> vsnode <ID>	Dump graph symbol contents on a particular vsnode
show	Show graph information, including symbols, arguments, host functions, resources, profile
break section <section_id> state <state_id>	Add a breakpoint at section:state specified
delete section <section_id> state <state_id>	Delete a breakpoint at section:state specified
breakpoints	List all breakpoints
next	Go to the next graph FSM state
continue	Continue graph FSM execution
list_step	List all valid section:state combination in the current run
?	Context sensitive help
enter / space	Auto-completion
up / down	Move to commands in history
CTRL-C	Delete and abort the current command

Command

Comments

connect

Connect to a remote target

disconnect

Disconnect from a remote target

exit

Exit from the CLI

dump symbol <symbol name> vsnode <ID>

Dump graph symbol contents on a particular vsnode

show

Show graph information, including symbols, arguments, host functions, resources, profile

break section <section_id> state <state_id>

Add a breakpoint at section:state specified

delete section <section_id> state <state_id>

Delete a breakpoint at section:state specified

breakpoints

List all breakpoints

Go to the next graph FSM state

continue

Continue graph FSM execution

list_step

List all valid section:state combination in the current run

Context sensitive help

enter / space

Auto-completion

up / down

Move to commands in history

CTRL-C

Delete and abort the current command

Sysadmin: Troubleshoot common errors

Certain errors can be addressed only by system administrators. If you’re a developer and encounter one of these errors, get in touch with your system administrator.

No RDUs enumerated on PCIe bus error

An error indicates that no RDU devices are available. When you see such an error:

Check /var/log/sambaflow/runtime/snd.log for the log message No RDUs discovered.
Run lspci -nnk -d 1e0d: to check if you have any RDUs configured in this system.
- If the system is healthy, you get a non-empty list.
- Otherwise, if there are problems, you get an empty list.
Power cycle the system. See Gracefully shutting down the DataScale SN30 rack and Power on process overview.
If a full power cycle does not fix the problem, contact SambaNova support.

SambaNova Hugepages errors

An operating system organizes memory into pages. By default, pages in Linux are defined to have 4KB. However, Linux administrators can define larger pages called Hugepages. These larger memory pages mean the operating system has to manage fewer pages and can access memory faster.

For SambaNova systems, hugepagesize is 1GB.

Both SambaNova Linux kernels have a certain number of Hugepages defined to support better performance.

Hugepages become available after the first reboot after installation.

Errors caused by Hugepages unavailable

If Hugepages are not available, you might see errors like the following:

Host memory error

You might see an error like the following:

[ERR][MEM][13727]: Could not allocate host region
[ERR][LIB][13727]: Unable to initialize host memory
[ERR][LIB][13727]: Unable to create ResourceDB

Hugepages error in in SND log

2020-5-15 17:2:1 em-labhostg14: [ERR][MEM][28646]: Hugepages [1GB] are not configured, reboot may be needed after first installation

Solution for Hugepages errors

Ensure that SambaNova Runtime is installed and reboot the system to enable Hugepages.
To check if Hugepages are available, examine /proc/meminfo. Verify that several hugepages (more than 5) are defined and that hugepagesize is 1GB.

SND troubleshooting

For each version of SambaNova Runtime we support only the kernel version that is listed in the Runtime release notes. If the host booted into an unsupported Linux kernel, SND might fail.

Look for an RDU driver not found message in the log.
Run uname -r to see the current kernel version.
Boot into a supported combination of Runtime and OS.

SambaNova strongly recommends that you do not perform a major upgrade or a kernel update to the DataScale SN30-H host module OS without referring to the supported OS, kernel, and package versions noted within this document and the software release notes because the SambaNova software relies on some strict packages dependencies. SambaNova recommends that you do not perform any major updates unless you are directed to do so by SambaNova.

Reset RDUs

Do not cancel or kill the reset program while it is in progress. Doing so could leave the host unable to connect to your DataScale system. If the host becomes unable to connect, you have to power cycle the system. For SN30 systems, see Gracefully shutting down the DataScale SN30 rack and Power on process overview. Similar information is in the documentation for earlier hardware.

If you see a tile-level error, the system might be able to recover from the error or you might have to take action.

Here are the steps for an SN30 system to check if the system is able to recover, and to perform a reset otherwise. The steps are very similar for earlier versions of the hardware.

Check if automatic recovery succeeded:
- If automatic recovery succeeded, you see:
  - No tile or RDU faults in snfadm -l fault output.
  - In kern.log: Automatic recovery of RDU SUCCEEDED or RDU 1: RDU reset sequence complete.
  - In snd.log: RDU :%d: RDU reset sequence complete.
- If automatic recovery failed, you see:
  - The log message RDU %d: Automatic RDU reset has failed. in kern.log.
  - Tile or RDU faults in snfadm -l fault output.
    
    If the automatic recovery fails, continue to the next step.
Run the snconfig reset rdu --sanity-check-only to make sure that the recovery is possibye. If it is not, power cycle the system. For SN30 systems, see Gracefully shutting down the DataScale SN30 rack and Power on process overview. Similar information is in the documentation for earlier hardware.
Run snconfig reset rdu.

snconfig detects which RDUs to reset and resets them.
If the manual RDU reset fails, power cycle the system.

It’s critical that you powercycle system components in the correct order.For SN30 systems, see Gracefully shutting down the DataScale SN30 rack and Power on process overview. Similar information is in the documentation for earlier hardware.