Troubleshoot SambaNova Runtime
Developers and administrators sometimes experience problems when they’re using their SambaNova environment. This page helps with finding the reasons and resolving the problem.
The page includes:
-
A table of frequently asked questions with links to answers.
-
Common errors and how to resolve them.
-
Troubleshooting tasks, such as resetting RDUs.
Frequently asked questions
Question | See |
---|---|
Where do I find logs? |
Runtime logs are at |
How do I increase logging verbosity? |
See Change log levels. |
What are the tools an administrator can use? |
Tools include snconfig, SambaNova Fault Management (SNFADM), and Slurm |
How can I programmatically interact with SambaNova Runtime? |
The SNML API allows all users to retrieve information, and the SNML admin API supports administrative tasks. SambaNova Management Layer (SNML) discusses the 2 APIs and includes code examples. |
Where can I find SNFM logs? |
Because SNFM is a part of SND, you can find SNFM log messages in |
When and how do I do an RDU reset? |
The Runtime components does automatic RDU recovery in many situations. However, at times, administrators have to Reset RDUs. |
How do I gracefully shut down and power up my rack? |
Under rare circumstances, system administrators might have to power cycle the rack. See Gracefully shutting down the DataScale SN30 rack and Power on process overview the DataScale system administration document. |
Troubleshoot your installation
By default, the SambaNova Runtime package is included in your environment. SambaNova periodically releases new versions. Install the most recent version of the complete SambaFlow package (which includes Runtime) when the package becomes available so that your OS and other Runtime components remain compatible with the SambaNova hardware.
When there are problems with your installation, you might see an error like the following.
Nov 1 17:28:54 host1 python3[3654]: File "__init__.py", line 87, in init pysnstat.__init__ Nov 1 17:28:54 host1 python3[3654]: File "pysnstat.py", line 42, in pysnstat.pysnstat.get_platform_module Nov 1 17:28:54 host1 python3[3654]: RuntimeError: SNStat: unknown platform Unknown Platform
Administrators can follow these steps to resolve the issue:
You need superuser privileges for some of the commands. |
-
Make sure the package is installed.
-
For an Ubuntu system, run these commands:
apt list --installed | grep sambanova-runtime apt list --installed | grep sambaflow
Verify that the command returns something like the following:
sambanova-runtime-diag/now 1.14.4-2210171355 amd64 [installed,upgradable to: 1.14.4-2211011246] sambanova-runtime-mlnx/now 1.14.4-2210171355 amd64 [installed,upgradable to: 1.14.4-2211011246] sambanova-runtime-scripts/focal,now 1.14.2-2209130703 all [installed,upgradable to: 1.14.4-2211011246] sambanova-runtime/now 1.14.4-2210171355 amd64 [installed,upgradable to: 1.14.4-2211011246]
-
For RHEL, run this command:
rpm -qi sambaflow
Verify that the command returns something like the following:
Name : sambaflow Version : 1.14.3 Release : 8.el8 Architecture: x86_64 Install Date: Mon 14 Nov 2022 05:38:50 PM EST Group : SambaFlow Size : 0 License : (c) SambaNova Systems Signature : (none) Source RPM : sambaflow-1.14.3-8.el8.src.rpm Build Date : Sat 05 Nov 2022 02:19:06 PM EDT Build Host : sc-c15_DOCKER Relocations : /opt/sambaflow Vendor : SambaNova Summary : SambaNova SambaFlow
-
-
Make sure your OS/kernel version is supported. See the Runtime Release Notes.
-
Make sure the system is booted into an appropriate kernel version/OS pair.
-
Check SDN status. You should see something like this:
root@sn-host15:~# systemctl status snd ● snd.service - SN Devices Service Loaded: loaded (/lib/systemd/system/snd.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/snd.service.d └─override.conf Active: active (running) since Mon 2022-10-17 15:12:14 PDT; 2 weeks 2 days ago Main PID: 86552 (snd) Tasks: 9 (limit: 629145) Memory: 54.2M CGroup: /system.slice/snd.service └─86552 /opt/sambaflow/bin/snd
-
Make sure the RDU driver is loaded:
lsmod | grep rdu
For the RDU driver, you might see
rdu_sn20
,rdu_sn30
, or similar for newer hardware architectures.If SND is running but the driver is not loaded, the wrong runtime package might be installed, and SND might have failed over to standalone mode.
-
Verify that the driver is loaded by looking for messages like the following in
snd.log
:Oct 31 21:18:56 host1 [root][snd][1823357]: [NOTICE][SND][meta hostname=host1 tid=1823357] No RDU found. Oct 31 21:18:56 host1 [root][snd][1823357]: [NOTICE][SNML_SERVER][meta hostname=host1 tid=1823357] SND SNML server started at 127.0.0.1:50053 Oct 31 21:18:57 host1 [root][snd][1823357]: [ERR][SNFM][meta hostname=host1 tid=1823357] Sambaflow package platform (DataScale SN10) does not match physical RDU platform Oct 31 21:18:57 host1 [root][snd][1823357]: [ERR][SNFM][meta hostname=host1 tid=1823357] Platform name unknown
For example, a message
Sambaflow package platform (DataScale SN10) does not match physical RDU platform
means that the package you have installed (sn10) does not match the physical platform you’re on.
Developer: Troubleshoot common errors
This section lists errors that developers might encounter and that they can address even if they don’t have administrator or superuse privileges. Errors are shown in the console and included in log files:
-
When you compile or run a model or perform other management tasks, most errors are shown on the console.
-
Errors are also added to one of the logs in
/var/log/sambaflow/runtime
.
PEF Version error
Error
[ERR][LIB]: sn_topology_get_version: PEF Version (Expected = [0.7.0], Actual = [0.7.2])
[ERR][LIB]: sn_create_session: PEF version mismatch. Abort...
Cause
PEFs (Processor Executable Format) files are only forward compatible. If you see this message, the PEF version of the installed Runtime (actual) and the version of the compiled PEF (expected) don’t match.
When you install SambaFlow, you know install the complete software stack. In earlier versions of the software, it was possible to install components separately. |
Solution
-
To identify the compatible PEF for the installed runtime, run the following commands on the host.
-
Red Hat
$ rpm -q --provides sambanova-runtime | grep PEF PEF = 0.13.0 # This is the compatible PEF version for the installed sambanova-runtime
-
Ubuntu
$ apt show sambanova-runtime | grep -i pef Provides: pef (= 0.13.0) # This is the compatible PEF version for the installed sambanova-runtime
-
-
If that version does not match the sambanova-runtime version, reinstall the whole package and compile and run your application again.
Runtime errors
When you run your application, errors are shown as an exception. The /var/log/sambaflow/runtime/sn.log
file shows a resource allocation error, and the Python runtime exception backtrace shows ResourceDB Alloc Failure
.
This section lists common runtime errors
Device ISA not supported error
Error
[ERR][DEV][22489]: Unable to allocate resources requested by graph. Reason: Bad address
[ERR][LIB][22489]: Resource Allocation failed: Device ISA not supported
[ERR][LIB][22489]: Unable to allocate requested rsc
Cause
The RDU architecture that the PEF was compiled for does not match the RDU architecture on the system.
Solution
This is a rare error that can occur if your model was compiled on a different version of the hardware.
Recompile the model on the current version of the hardware.
No Tiles Available error
Error
[ERR][DEV][5383]: Unable to allocate resources requested by graph. Reason: Not Enough Resources available
Requested Node(s): 1 {2H} Available RDU(s): 8 {0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1}
Requested Memory Per Node (B): {8256}
Total: Available Memory Blocks Per Node(GB):
383: {95, 95, 95, 95}
383: {95, 95, 95, 95}
383: {95, 95, 95, 95}
383: {95, 95, 95, 95}
383: {95, 95, 95, 95}
383: {95, 95, 95, 95}
383: {95, 95, 95, 95}
383: {95, 95, 95, 95}
[ERR][LIB][5383]: Resource Allocation failed: No Tile(s) Available
[ERR][LIB][5383]: Unable to allocate requested rsc
Cause
This error indicates that there aren’t enough hardware resources to run your application.
Solution
Here’s what you can do:
-
If you know that others are also running models on this system, the problem might be that their models are using too many of the resources. Wait, and try again later.
-
Ask your system administrator to check if the system is healthy and to run
sntilestat
to see the status of the hardware.
No Device Memory Available error
Error
[ERR][DEV][7990]: Unable to allocate resources requested by graph. Reason: Not Enough Resources available
Requested Node(s): 1 {2H} Available RDU(s): 8 {0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf}
Requested Memory Per Node (B): {8256}
Total: Available Memory Blocks Per Node(GB):
0: {0, 0, 0, 0}
0: {0, 0, 0, 0}
0: {0, 0, 0, 0}
0: {0, 0, 0, 0}
0: {0, 0, 0, 0}
0: {0, 0, 0, 0}
0: {0, 0, 0, 0}
0: {0, 0, 0, 0}
[ERR][LIB][7990]: Resource Allocation failed: No Device Memory Available
[ERR][LIB][7990]: Unable to allocate requested rsc
Cause
The device resource allocation failures are accompanied by a summary that displays:
-
The resources requested by the model
-
The available resources on the system
Information is available for nodes and memory:
-
Nodes: The error shows the number of nodes requested by the model and the corresponding orientation. The available RDUs on the system is accompanied by the tile bitmask available on each of them. The least 4 bits represent the availability of the 4 tiles on the RDU. For example, a bitmask of 0x5 means that tiles 0 and 2 are available.
-
Memory: The requested memory per node is shown in bytes. Each line in the
Available Memory
section of the summary corresponds to one RDU. Total is followed by how the memory is partitioned and the memory available for that partition.
If you set rdu_log_level=255
, the details of allocation options that Runtime tried are available in the kernel logs (use dmesg
). See Change kernel log levels.
Solution
Wait for a while and try your command again. How long you have to wait depends on the types of job that are customarily submitted in your environment. Ask the system administrator to determine if other jobs are using all system resources and to verify system health.
Symbol lookup failure error
Error
Log ID initialized to: [user][app][pid] at /var/log/sambaflow/runtime/sn.log 2020-5-17 19:36:57 <hostname>: [ERR][MAP][pid]: Unable to Find Key: xxx 2020-5-17 19:36:57 <hostname>: [ERR][SYM][pid]: Invalid symbol id: -1 2020-5-17 19:36:57 <hostname>: [ERR][LIB][pid]: KeyError: xxx ... RuntimeError: Unable to set tensors property: Symbol Lookup Failure
Cause
Tensor KeyError occurs if a PEF is compiled with certain symbol name and at runtime, and then the application passes a different symbol name.
Solution
Check the samba logs to understand the expectation for the tensor name. You can find those logs in out/<pef_name>/<pef_name>.samba.log
and on the console.
Check the compiler logs to understand expectation for the tensor name. The two names should match.
Developer: Use Graph CLI to debug applications
In most cases, developers who use SambaFlow take advantage of Python debugging techniques when they see problems with their application. In rare situations, developers might use Graph CLI to explore. Graph CLI shows what’s happening close to the hardware.
How to run Graph CLI
-
Run your application with the environment variable SNCLI_SERVER.
SNCLI_SERVER="localhost:50052" python /path/to/app <args>
-
Launch a CLI client in another window.
$ /opt/sambaflow/diag/run_sncli.py
-
Connect the CLI client to a graph.
$ connect graph localhost:50052
-
Get graph information.
$ show $ dump symbol <symbol name> vsnode <ID>
-
Disconnect the CLI client from the graph.
$ disconnect
Graph CLI debug commands
Command | Comments |
---|---|
connect |
Connect to a remote target |
disconnect |
Disconnect from a remote target |
exit |
Exit from the CLI |
dump symbol <symbol name> vsnode <ID> |
Dump graph symbol contents on a particular vsnode |
show |
Show graph information, including symbols, arguments, host functions, resources, profile |
break section <section_id> state <state_id> |
Add a breakpoint at section:state specified |
delete section <section_id> state <state_id> |
Delete a breakpoint at section:state specified |
breakpoints |
List all breakpoints |
next |
Go to the next graph FSM state |
continue |
Continue graph FSM execution |
list_step |
List all valid section:state combination in the current run |
? |
Context sensitive help |
enter / space |
Auto-completion |
up / down |
Move to commands in history |
CTRL-C |
Delete and abort the current command |
Sysadmin: Troubleshoot common errors
Certain errors can be addressed only by system administrators. If you’re a developer and encounter one of these errors, get in touch with your system administrator.
No RDUs enumerated on PCIe bus error
An error indicates that no RDU devices are available. When you see such an error:
-
Check
/var/log/sambaflow/runtime/snd.log
for the log messageNo RDUs discovered
. -
Run
lspci -nnk -d 1e0d:
to check if you have any RDUs configured in this system.-
If the system is healthy, you get a non-empty list.
-
Otherwise, if there are problems, you get an empty list.
-
-
Power cycle the system. See Gracefully shutting down the DataScale SN30 rack and Power on process overview.
-
If a full power cycle does not fix the problem, contact SambaNova support.
SambaNova Hugepages errors
An operating system organizes memory into pages. By default, pages in Linux are defined to have 4KB. However, Linux administrators can define larger pages called Hugepages. These larger memory pages mean the operating system has to manage fewer pages and can access memory faster.
For SambaNova systems, hugepagesize is 1GB.
Both SambaNova Linux kernels have a certain number of Hugepages defined to support better performance.
Hugepages become available after the first reboot after installation. |
Errors caused by Hugepages unavailable
If Hugepages are not available, you might see errors like the following:
Host memory error
You might see an error like the following:
[ERR][MEM][13727]: Could not allocate host region
[ERR][LIB][13727]: Unable to initialize host memory
[ERR][LIB][13727]: Unable to create ResourceDB
Hugepages error in in SND log
2020-5-15 17:2:1 em-labhostg14: [ERR][MEM][28646]: Hugepages [1GB] are not configured, reboot may be needed after first installation
SND troubleshooting
For each version of SambaNova Runtime we support only the kernel version that is listed in the Runtime release notes. If the host booted into an unsupported Linux kernel, SND might fail.
-
Look for an
RDU driver not found
message in the log. -
Run
uname -r
to see the current kernel version. -
Boot into a supported combination of Runtime and OS.
SambaNova strongly recommends that you do not perform a major upgrade or a kernel update to the DataScale SN30-H host module OS without referring to the supported OS, kernel, and package versions noted within this document and the software release notes because the SambaNova software relies on some strict packages dependencies. SambaNova recommends that you do not perform any major updates unless you are directed to do so by SambaNova. |
Reset RDUs
Do not cancel or kill the reset program while it is in progress. Doing so could leave the host unable to connect to your DataScale system. If the host becomes unable to connect, you have to power cycle the system. For SN30 systems, see Gracefully shutting down the DataScale SN30 rack and Power on process overview. Similar information is in the documentation for earlier hardware. |
If you see a tile-level error, the system might be able to recover from the error or you might have to take action.
Here are the steps for an SN30 system to check if the system is able to recover, and to perform a reset otherwise. The steps are very similar for earlier versions of the hardware.
-
Check if automatic recovery succeeded:
-
If automatic recovery succeeded, you see:
-
No tile or RDU faults in
snfadm -l
fault output. -
In
kern.log
:Automatic recovery of RDU SUCCEEDED
orRDU 1: RDU reset sequence complete
. -
In
snd.log
:RDU :%d: RDU reset sequence complete
.
-
-
If automatic recovery failed, you see:
-
The log message
RDU %d: Automatic RDU reset has failed.
inkern.log
. -
Tile or RDU faults in
snfadm -l fault
output.If the automatic recovery fails, continue to the next step.
-
-
-
Run the
snconfig reset rdu --sanity-check-only
to make sure that the recovery is possibye. If it is not, power cycle the system. For SN30 systems, see Gracefully shutting down the DataScale SN30 rack and Power on process overview. Similar information is in the documentation for earlier hardware. -
Run
snconfig reset rdu
.snconfig
detects which RDUs to reset and resets them. -
If the manual RDU reset fails, power cycle the system.
It’s critical that you powercycle system components in the correct order.For SN30 systems, see Gracefully shutting down the DataScale SN30 rack and Power on process overview. Similar information is in the documentation for earlier hardware. |