Monitoring the DataScale SN10-8R
The DataScale SN10-8R supports several standard methods to monitor and triage the system:
-
View diagnostic information and logs.
-
Set Simple Network Management Protocol (SNMP) alerts for third-party components.
-
Use SambaNova Daemon (SND).
This section explains to to use the logs and tools.
1. Viewing xrdutool diagnostic information
You can use the xrdutool
tool and logs to diagnose a DataScale SN10-2 issue and to collect information for SambaNova Support to triage an issue.
Use the tool to check the overall status of the DataScale SN10-2 RDU module and of the hosted RDUs and memory. Follow these steps to examine the output on the power and fault status of the DataScale SN10-2 board.
-
Log in to the BMC of the DataScale SN10-2 RDU module that is having problems:
$ ssh root@<BMC_IP_Address> Password: <Enter root password>
-
Run
xrdutool status
:root@xrdu:~# xrdutool status
The output provides a quick view into the state of the DataScale SN10-2 RDU module, the two RDUs, and the RDU controller. The output identifies whether any faults have been detected, and it shows the power state of the DataScale SN10-2 RDU module and of the RDU.
Power is on 2020-08-10 23:29:49,732 DEBUG RDU-C BuildDate: 7.14 1056 DesignVer: 28 BoardID: 28 2020-08-10 23:29:49,738 DEBUG -------------------------------------------------------- RDU-C BuildDate: 7.14 1056 DesignVer: 28 BoardID: 28 RDU-C Release Version: 1.5.5 XRDU_0: STATUS ------------------------------------------------------------------------------------------------------------ SYSTEM : chm1 chm0 stby ps pex0 pex1 sys p3v3 mss_op_state mss_log_level 1 1 1 1 1 1 1 1 4 1 ------------------------------------------------------------------------------------------------------------ RDU_0 <RDU_ID> ON and no current faults detected ------------------------------------------------------------------------------------------------------------ ENABLES: vddo pvpp pvdd pvddq pvtt pavddh pavdd vddc 1 1 1 1 1 1 1 1 PWRGOOD: vddo pvpp0 pvpp1 pvdd pvddq0 pvddq1 pvtt0 pvtt1 pavddh pavdd vddc0 vddc1 1 1 1 1 1 1 1 1 1 1 1 1 EVENTS : vddo pvpp0 pvpp1 pvdd pvddq0 pvddq1 pvtt0 pvtt1 pavddh pavdd vddc0 vddc1 0 0 0 0 0 0 0 0 0 0 0 0 -VRHOT: pvddc0 pvddc1 pvddc0 pvddq1 -FAULT: pvddc0 pvddc1 pvddq0 pvddq1 vr_alert 0 0 0 0 0 0 0 0 0 -THERM: thub0 thub1 therm 0 0 0 ------------------------------------------------------------------------------------------------------------ RDU_1 <RDU_ID> ON and no current faults detected ------------------------------------------------------------------------------------------------------------ ENABLES: vddo pvpp pvdd pvddq pvtt pavddh pavdd vddc 1 1 1 1 1 1 1 1 PWRGOOD: vddo pvpp0 pvpp1 pvdd pvddq0 pvddq1 pvtt0 pvtt1 pavddh pavdd vddc0 vddc1 1 1 1 1 1 1 1 1 1 1 1 1 EVENTS : vddo pvpp0 pvpp1 pvdd pvddq0 pvddq1 pvtt0 pvtt1 pavddh pavdd vddc0 vddc1 0 0 0 0 0 0 0 0 0 0 0 0 -VRHOT: pvddc0 pvddc1 pvddc0 pvddq1 -FAULT: pvddc0 pvddc1 pvddq0 pvddq1 vr_alert 0 0 0 0 0 0 0 0 1 -THERM: thub0 thub1 therm 0 0 0 ------------------------------------------------------------------------------------------------------------ CLEAR EVENTS :: cleared event status register on RDU_0 CLEAR EVENTS :: cleared event status register on RDU_1
For details on diagnosing a DataScale SN10-2 RDU module’s BMC and collecting the required diagnostic and log material, see KB article #1024, “DataScale SN10-2 Diagnostic Collection,” in the SambaNova Support portal (https://support.sambanova.ai).