Monitoring the DataScale SN10-8R

The DataScale SN10-8R supports several standard methods to monitor and triage the system:

  • View diagnostic information and logs.

  • Set Simple Network Management Protocol (SNMP) alerts for third-party components.

  • Use SambaNova Daemon (SND).

This section explains to to use the logs and tools.

1. Viewing xrdutool diagnostic information

You can use the xrdutool tool and logs to diagnose a DataScale SN10-2 issue and to collect information for SambaNova Support to triage an issue.

Use the tool to check the overall status of the DataScale SN10-2 RDU module and of the hosted RDUs and memory. Follow these steps to examine the output on the power and fault status of the DataScale SN10-2 board.

  1. Log in to the BMC of the DataScale SN10-2 RDU module that is having problems:

    $ ssh root@<BMC_IP_Address>
    Password: <Enter root password>
  2. Run xrdutool status:

    root@xrdu:~# xrdutool status

    The output provides a quick view into the state of the DataScale SN10-2 RDU module, the two RDUs, and the RDU controller. The output identifies whether any faults have been detected, and it shows the power state of the DataScale SN10-2 RDU module and of the RDU.

    Power is on
    2020-08-10 23:29:49,732 DEBUG RDU-C BuildDate: 7.14 1056   DesignVer: 28   BoardID: 28
    2020-08-10 23:29:49,738 DEBUG --------------------------------------------------------
    RDU-C BuildDate: 7.14 1056   DesignVer: 28   BoardID: 28
    RDU-C Release Version: 1.5.5
    XRDU_0: STATUS
    ------------------------------------------------------------------------------------------------------------
    SYSTEM :  chm1    chm0    stby    ps      pex0    pex1    sys     p3v3        mss_op_state   mss_log_level
               1       1       1       1       1       1       1       1               4               1
    ------------------------------------------------------------------------------------------------------------
    RDU_0 <RDU_ID> ON and no current faults detected
    ------------------------------------------------------------------------------------------------------------
    ENABLES:  vddo    pvpp            pvdd    pvddq           pvtt            pavddh  pavdd   vddc
               1       1               1       1               1               1       1       1
    PWRGOOD:  vddo    pvpp0   pvpp1   pvdd    pvddq0  pvddq1  pvtt0   pvtt1   pavddh  pavdd   vddc0   vddc1
               1       1       1       1       1       1       1       1       1       1       1       1
    EVENTS :  vddo    pvpp0   pvpp1   pvdd    pvddq0  pvddq1  pvtt0   pvtt1   pavddh  pavdd   vddc0   vddc1
               0       0       0       0       0       0       0       0       0       0       0       0
     -VRHOT:  pvddc0  pvddc1  pvddc0  pvddq1          -FAULT: pvddc0  pvddc1  pvddq0  pvddq1  vr_alert
               0       0       0       0                       0       0       0       0       0
     -THERM:  thub0   thub1   therm
               0       0       0
    ------------------------------------------------------------------------------------------------------------
    RDU_1 <RDU_ID> ON and no current faults detected
    ------------------------------------------------------------------------------------------------------------
    ENABLES:  vddo    pvpp            pvdd    pvddq           pvtt            pavddh  pavdd   vddc
               1       1               1       1               1               1       1       1
    PWRGOOD:  vddo    pvpp0   pvpp1   pvdd    pvddq0  pvddq1  pvtt0   pvtt1   pavddh  pavdd   vddc0   vddc1
               1       1       1       1       1       1       1       1       1       1       1       1
    EVENTS :  vddo    pvpp0   pvpp1   pvdd    pvddq0  pvddq1  pvtt0   pvtt1   pavddh  pavdd   vddc0   vddc1
               0       0       0       0       0       0       0       0       0       0       0       0
     -VRHOT:  pvddc0  pvddc1  pvddc0  pvddq1          -FAULT: pvddc0  pvddc1  pvddq0  pvddq1  vr_alert
               0       0       0       0                       0       0       0       0       1
     -THERM:  thub0   thub1   therm
               0       0       0
    ------------------------------------------------------------------------------------------------------------
    CLEAR EVENTS :: cleared event status register on RDU_0
    CLEAR EVENTS :: cleared event status register on RDU_1

For details on diagnosing a DataScale SN10-2 RDU module’s BMC and collecting the required diagnostic and log material, see KB article #1024, “DataScale SN10-2 Diagnostic Collection,” in the SambaNova Support portal (https://support.sambanova.ai).