Monitor and debug the DataScale SN30 rack

The DataScale® SN30 rack supports standard methods to monitor and triage the system. This page includes some tasks you can perform, such as examining log files, and also explains how collect diagnostic information for use with SambaNova support.

1. Overview of tools and logs

Several tools and logs can help you resolve problems. Here’s an overview:

Table 1. Monitoring and debugging tools
Task Tool See

Check the status of the DataScale SN30-2 RDU module

xrdutool

View xrdutool diagnostics and logs

Configure SNMP alerts for third-party rack components.

SNMP alerts

Set up SNMP alerts

Diagnose problems with logs.

OS logs, BMC logs, compiler logs, application logs

Viewing system logs

Check and manage SND, view SND logs.

SND (SambaNova Daemon)

SambaNova daemon (SND) diagnostics

Debug model compilation, running models, and third-party components

Misc. tools and logs

Debugging DataScale SN30 issues

If you cannot resolve the issues yourself, create a support case and include diagnostic materials. See View SN30-H BMC diagnostic information and logs.

2. View xrdutool diagnostics and logs

You use the xrdutool tool and logs to diagnose a DataScale SN30-2 issue and to collect information for SambaNova Support to triage an issue. The tool gets the status of the DataScale SN30-2 RDU module that the tool is run on.

Use the tool to check the overall status of the DataScale SN30-2 RDU module and of the hosted RDUs and memory. Follow these steps to examine the output on the power and fault status of the DataScale SN30-2 board:

  1. Log in to the DataScale SN30-2 RDU module’s BMC that is having problems:

    $ ssh root@<BMC_IP_Address>
    Password: <Enter root password>
  2. Run the xrdutool command:

    root@xrdu:~# xrdutool status
  3. Examine the output, which gives a quick view into the state of the DataScale SN30-2 RDU module along with two RDUs and the RDU controller. The output:

    • Shows whether any faults have been detected.

    • Shows the power state of the DataScale SN30-2 RDU module and of the RDU.

Here’s an example:

Power is on
RDU-C Release Version: 4.4.0
RDU-C BuildDate: 10.17 1654   DesignVer: 69   BoardID: 60
XRDU_0: STATUS
--------------------------------------------------------
SYSTEM :  rdu3    rdu2    rdu1    rdu0    stby    ps      pex0    pex1    sys     p3v3        mss_op_state   mss_log_level
           1       1       1       1       1       1       1       1       1       1               4               1
--------------------------------------------------------
RDU_0/D_0  0935a00001f1d6a4 102007b367359895     RDU_0/D_1  09a6c000012eda24 605007b367359895     ON. Please verify rdu_pwr_status[0] value to determine faults
--------------------------------------------------------
ENABLES:  vddo    pvpp            pvdd    pvddq           pvtt            pavddh  pavdd   vddc
           1       1               1       1               1               1       1       1
PWRGOOD:  vddo    pvpp0   pvpp1   pvdd0    pvdd1  pvddq0  pvddq1  pvtt0   pvtt1   pavddh  pavdd   vddc0   vddc1   vddc2   vddc3
           1       1       1       1       1       1       1       1       1       1       1       1       1       1       1
--------------------------------------------------------
RDU_1/D_0  09e9a00001a5dc64 502807b367359895     RDU_1/D_1  08e8200000bedd24 107007b367359895     ON. Please verify rdu_pwr_status[1] value to determine faults
--------------------------------------------------------
ENABLES:  vddo    pvpp            pvdd    pvddq           pvtt            pavddh  pavdd   vddc
           1       1               1       1               1               1       1       1
PWRGOOD:  vddo    pvpp0   pvpp1   pvdd0    pvdd1  pvddq0  pvddq1  pvtt0   pvtt1   pavddh  pavdd   vddc0   vddc1   vddc2   vddc3
           1       1       1       1       1       1       1       1       1       1       1       1       1       1       1
--------------------------------------------------------
PEX_0:   fpga_p0v8_pex_pgd2   pg_p1v25_pex   pg_p1v8_pex_pll   fpga_pg_p1v8_pex
               1               1               1               1
--------------------------------------------------------
PEX_1:   fpga_p0v8_pex_pgd2   pg_p1v25_pex   pg_p1v8_pex_pll   fpga_pg_p1v8_pex
               1               1               1               1
--------------------------------------------------------
rduc_pwr_status[0] = 0x7fff
rduc_pwr_status[1] = 0x7fff
pex_pwr_status[0] = 0x7f
pex_pwr_status[1] = 0x7f
power_status_aggregate = 0x7fff
Board Type: 3
NUM_RDUS: 2
NUM_DIE_PER_RDU: 2
NUM_DIES: 4

In addition to collecting diagnostic information from the SN30-2 RDU module directly, you can get health status of all the SN30-2 RDU modules in the SN30-8 node by using the SambaNova Fault Management (SNFM) utility that comes pre-installed on the host. See the SambaNova Fault Management (SNFM) User Guide in the SambaNova Runtime documentation in the SambaNova documentation portal (https://docs.sambanova.ai).

For details on diagnosing a DataScale SN30-2 RDU module’s BMC and on collecting the required diagnostic and log material, see KB article #1024, "DataScale SN30-2 Diagnostic Collection", in the SambaNova Support portal.

3. Set up SNMP alerts

To configure SNMP alerts for non-SambaNova components in the DataScale SN30 rack, see the vendor-specific documentation.

4. Viewing system logs

You can use the following log files to identify and resolve issues with the system or an application:

  • OS logs

  • BMC logs

  • SambaNova compiler logs

  • Application logs

4.1. OS logs

SambaNova does not alter the logs or log directories for Red Hat Enterprise Linux or Ubuntu. The /var/log/ directory contains most of the logs and other log tools such as journalctl.

4.3. SambaNova compiler logs

Additional logs for the compilers are available in a user-specified directory that was specified at the time the models were compiled. These logs are fairly low level and are requested by SambaNova Support to troubleshoot issues. For details, see Collect diagnostic materials for SambaNova Support.

You can use different compiler log verbosity settings to debug issues. See the SambaFlow Runtime document for details.

4.4. Runtime logs

The following log files related to SambaNova are in the /var/log/sambaflow/runtime/ directory:

sn.log

Logs related to SambaNova graph operations. Events received by the graph process and graph-specific events (including errors) that are not logged to snd.log.

snd.log

SambaNova daemon (SND) system logs. Summary of RDU resources and hardware error events.

Additional log events such as kernel logs (from the RDU driver module) go to dmesg(1).

You can use different log verbosity settings to get more logging details for the SambaFlowTM Runtime and other SambaFlow components. See "Changing Runtime Log Levels" in the SambaNova Runtime Guide.

5. SambaNova daemon (SND) diagnostics

The SambaNova daemon (SND) is running on the DataScale SN30-H host module and manages several critical pieces of the SambaNova operation. The SND is responsible for:

  • Loading and unloading the RDU drivers

  • Initializing RDU system resources

  • Managing hardware faults for the RDU system

  • Enabling the debugging of the RDU system’s hardware resources

The SND is required to run graphs and models because:

  • The SND handles the RDU drivers and the initialization of RDU resources.

  • The SND is aware of issues with RDU resources and can avoid problematic resources.

The SND starts automatically:

  • At boot time of the DataScale SN30-H OS and starts the discovery and initialization of the RDUs. This is why it is important to power on the DataScale SN30-2 RDU modules first, before powering on the SN30-H host module.

  • When the SambaFlow package is installed. In this case, the SND waits a few minutes after the installation for the RDU system discovery and initialization processes to complete.

5.1. Check SND status

To check the status of the SND, run the systemctl status snd command. Below is sample output showing what the command might return:

$ sudo systemctl status snd
● snd.service - SN Devices Service
     Loaded: loaded (/lib/systemd/system/snd.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/snd.service.d
             └─override.conf
     Active: active (running) since Wed 2022-10-19 07:10:10 PDT; 3h 24min ago
   Main PID: 5263 (snd)
      Tasks: 10 (limit: 629145)
     Memory: 164.9M
     CGroup: /system.slice/snd.service
             └─5263 /opt/sambaflow/bin/snd

5.2. Start, stop, and restart SND

You can start, stop, and restart the SND with the following commands:

To start the SND:

$ sudo systemctl start snd

To stop the SND:

$ sudo systemctl stop snd

To restart the SND:

$ sudo systemctl restart snd

5.3. Use SND for debugging

The SND CLI provides physical visibility into the entire DataScale SN30-8 system. This allows complete access to the RDU system for debugging, triage, and validation efforts.

The SND is also responds to error events that occur on the RDU and on the entire DataScale SN30-2 RDU module.

All logs from the SND are written to /var/log/sambaflow/runtime/snd.log. This log provides a summary of the RDU resources available to the system and includes any hardware error events that occur. The information is useful for diagnosing and resolving hardware issues.

6. Debugging DataScale SN30 issues

Troubleshooting might require that you debug issues with the following DataScale SN30 rack components:

  • Compilation of models

  • Running of models

  • Third-party components

7. Debug model compilation

For problems that occur while compiling models, run the following command and examine the logs that are generated in the user-specified output directory:

$ python <model_script.py> compile --output-folder=<output_directory>

You can set different levels of logging verbosity when you compile a model. See Collect diagnostic materials for SambaNova Support for best practice when creating a support case.

7.1. Debug running models

For problems that occur while running models, use these resources:

  • The /var/log/sambaflow/runtime/ log files

    These logs provide an initial glance into an issue that is occurring while running a model. If a problem does occur and is reproducible, enable more logging verbosity for SambaFlow Runtime. See the "Changing Runtime Log Levels" section of the SambaNova Runtime Guide for details.

  • The SambaNova Fault Management (SNFM) tool

    The SNFM tool provides a framework to

  • Monitor, log, and clear various faults associated with a DataScale SN30-2 RDU module

  • Provide corrective actions to recover from these faults.

This capability is built into the SambaNova daemon (SND) and installed as part of SambaFlow. See "SambaNova Fault Management (SNFM) User" in the SambaNova Runtime Guide for details.

7.2. Debug third-party components

For operational issues with the third-party components in the DataScale SN30 rack, see the vendor-specific documentation. For issues that require additional support or for questions related to troubleshooting, open a support case through SambaNova Support. See KB article #1017, "SambaNova Systems Support Best Practices," at https://support.sambanova.ai.

Do not open a case directly with the product vendor.

8. Collect diagnostic materials for SambaNova Support

When you open a support case, provide details on the issue that has occurred, and initial diagnostic materials. For collecting diagnostic materials, See the following KB articles in the SambaNova Support portal:

  • DataScale SN30-2 Diagnostic Collection: KB article #1024

  • DataScale SN30-H BMC Diagnostic Collection: KB article #1039

  • DataScale SN30-H (Red Hat Enterprise Linux) Diagnostic Collection: KB article #1039

  • DataScale SN30-H (Ubuntu) Diagnostic Collection: KB article #1039

  • Ethernet Data Switch Diagnostic Collection: KB Article #1053

  • Access Switch Diagnostic Collection: KB article #1053

  • Serial Console Server Diagnostic Collection: KB article #1121

  • PDU Diagnostic Collection: KB article #1120

9. View SN30-H BMC diagnostic information and logs

To quickly identify a system’s status and view diagnostic information and logs for the DataScale SN30-H BMC, follow these steps:

  1. Log in to the BMC’s Web UI and view the BMC dashboard.

    Diagnostic information

  2. For details on logs and pending events/deassertions, click the More info link in each box.

  3. To find more logs and reports, click Logs & Reports in the left pane and select a log.

    Logs & Reports item

See KB article #1039, “Diagnostic Data Collection Tool(samba_diag),” in the SambaNova Support portal (https://support.sambanova.ai) for details on:

  • Diagnosing a DataScale SN30-H host module’s BMC

  • Diagnosing the DataScale SN30-H host module in general

  • Collecting the required diagnostic materials and logs.