SambaNova Fault Management (SNFM)
The SambaNova Fault Management (SNFM) framework supports reporting, diagnosing, and analyzing the system error and fault events associated with a DataScale system. The fault management framework can:
-
Take automatic recovery action for certain component failures
-
Advise corrective action to recover from faults.
For example:
-
RDUs might get reset automatically in certain situations.
-
You might see
check component suspect list
orreseat DIMM
if you runsnfadm -l fault
and specify the verbose option. See Access fault information.
All SNFM dependencies are installed as part of the sambaflow
package. SNFM is part of the SND (SambaNova Daemon) service.
SNFM focuses on events that are related to the DataScale modules that are part of a DataScale rack and some host components, for example, PCIE connectivity from host to XRDU. For example:
-
Reconfigurable Dataflow Unit (RDU) chips
-
Tiles within a RDU
-
Local PCIe connections between RDUs
-
RDU device memory
See the following blog articles on our public website for some background:
How to use SNFADM
The user interface to SNFM is the snfadm
tool.
-
package:
sambaflow
-
location:
/opt/sambaflow/bin/snfadm
-
format: Python 3.7 executable.
To access SNFM programmatically, you can use the SambaNova Management Layer (SNML)
Command-line options
To see all command-line options for snfadm
, run:
$ /opt/sambaflow/bin/snfadm --help
The command returns all arguments.
Required SNFADM arguments
Only -l
is required. Use -l
to list the different types of events, as discussed in:
Important optional SNFADM arguments
To see a full list of arguments, call /opt/sambaflow/bin/snfadm --help
. Here are some important arguments:
-
The
-a/--all-faults
argument includes cleared faults in the output. -
The
-v/--verbose
argument displays the long format output of an error or fault entry. -
The
--csv
argument returns the output in CSV format, which is easy to import into a spreadsheet and easy to feed to an automation script for processing. -
The
-u/--uuid
argument can be set to a single fault’s uuid to show information about only that fault or error. -
The
-c/--clear-fault
argument marks a fault as cleared. Use --c/-clear-fault
together with the-u/--uuid
argument to mark a specific fault as cleared. Marking a fault as cleared does not necessarily fix the problem. Other actions are usually necessary to fix the underlying cause. -
The
-ca/--clear-all
argument clears all faults at once.
SNFADM output
You have several choices for SNFADM output:
-
The default output format is short (tabular) format. Short format shows the contents of the inventory, error, and fault events. For the fault events, only active faults are shown by default.
-
If you specify
-v/--verbose
, the display includes details for each event. -
If you specify
--csv
you get output in CSV format, which is easy to import into a spreadsheet and easy to feed to an automation script for processing.
Naming of DataScale components in SNFM output
To understand SNFADM output, you have to understand how a component name is a logical representation of the system topology. Here’s an example of a DIMM module name:
/NODE/XRDU_0/RDU_1/DDRCH_1/DIMM_A0
Each level of the hierarchy represents a component class with instance number
and the levels delimited by /
.
-
The name always starts with a forward slash delimiter
/
. -
The host is the root of the hierarchy
-
The lastlevel component class is represented by the full name string.
-
We represent a component class name and the instance number as
<component_class_name>_<instance_number>
, for example, XRDU_3, DIMM_J1
Here are the component class representations:
NODE |
The SN10-H or SN30-H host |
XRDU |
The SN10-2 or SN30-2 module. Each module contains 2 RDUs. |
RDU |
Reconfigurable Dataflow Unit within XRDU |
TILE |
Tiles within an RDU |
DDRCH |
RDU device memory channel |
DIMM |
Memory module of a DDRCH |
Access inventory information
You access inventory information by calling the following command:
$ /opt/sambaflow/bin/snfadm -l inventory
The output includes information about each component that is expected to be present on a system, based on that system’s configuration (e.g. SN10-8).
The output includes this information:
-
Physical State.
Present
orAbsent
, indicating whether or not they were found on the system during a scan. -
Functional State.
Online
,Degraded
,Offline
,Faulted
,NA
.
Here’s a partial inventory list of a system where an RDU is in Degraded
state since one tile is Faulty
. The RDU is functional but with reduced capacity. The Fault information shows the tile fault details.
Here are the first few lines of sample output:
Platform Name: DataScale SN20-8
Physical Inventory:
Component Name | Serial Number | Inventory State| Functional State
/NODE/XRDU_0/RDU_0 | 60604234E6D94715 | Present | Online
/NODE/XRDU_0/RDU_0/DDRCH_0/DIMM_C0 | 44FEFCED | Present | Online
/NODE/XRDU_0/RDU_0/DDRCH_0/DIMM_C1 | 44FEFC41 | Present | Online
/NODE/XRDU_0/RDU_0/DDRCH_1/DIMM_A0 | 44FEE128 | Present | Online
/NODE/XRDU_0/RDU_0/DDRCH_1/DIMM_A1 | 44FEEEB8 | Present | Online
/NODE/XRDU_0/RDU_0/DDRCH_2/DIMM_B0 | 44FEFDEA | Present | Online
[...]
/NODE/XRDU_0/RDU_0/PCIE_0 | N/A | Present | Online
/NODE/XRDU_0/RDU_0/PCIE_1 | N/A | Present | Online
/NODE/XRDU_0/RDU_0/PCIE_2 | N/A | Present | Online
/NODE/XRDU_0/RDU_0/PCIE_3 | N/A | Present | Online
/NODE/XRDU_0/RDU_0/TILE_0 | N/A | Present | Online
/NODE/XRDU_0/RDU_1 | 30484234E6D94715 | Present | Online
[...]
Access error information
You access error information by calling the following command:
$ /opt/sambaflow/bin/snfadm -l error
The command returns details about the event source. Each error event is tagged with an UUID that the system uses to retrieve detailed information about the event and potentially associate the error event with a fault.
To see information about one error event, pass it in at the command line like this:
$ /opt/sambaflow/bin/snfadm -l error -v -u df67f9cb-9c1d-11ed-9939-d5ca835010b5
The output for one error might look like this:
[
Original Timestamp 2023-01-24 11:32:43
Last Timestamp 2023-01-24 11:32:43
Error UUID df67f9cb-9c1d-11ed-9939-d5ca835010b5
Error Count 1
Error Type ETYPE_PCIE_LINK_HEALTH
Component Name /NODE/XRDU_0/RDU_0/PCIE_0
Description ETYPE_PCIE_LINK_HEALTH
Fault UUID 00000000-0000-0000-0000-000000000000
]
Access fault information
Error events occuring on a system can lead to components being diagnosed as faulty. When that happens, the error event has a corresponding entry for the component in the fault log.
To list all active faults, run this command:
$ /opt/sambaflow/bin/snfadm -l fault
The output might look like this:
Timestamp | Fault UUID | Fault Type | Component Name | Status
2022-09-13 17:10:31 | 7fcce2d9-33a8-11ed-83c1 | FTYPE_PCI_LINK_HEALTH | /NODE/XRDU_2/RDU_1/PCIE_1 | Active
Each fault is associated with a specific error UUID that can be used to retrieve information about relevant events (shown in the last line of the output below). If the fault UUID is 0, no associated diagnosed fault exists.
$ /opt/sambaflow/bin/snfadm -l fault -v -u ff467ec5-9e98-11ed-a6b6-84160cc0f680
[
Timestamp 2023-01-27 15:19:07
Fault UUID ff467ec5-9e98-11ed-a6b6-84160cc0f680
Fault Type FTYPE_PCI_LINK_HEALTH
Component Name /NODE/XRDU_0/RDU_0/PCIE_4
Functional State Active
Severity SNFM_FATAL
Serial Number N/A
Description PCIE Link Fault [Suspect List: /NODE/XRDU_0/RDU_0/PCIE_4, /NODE/XRDU_0/QSFP/P7, /NODE/XRDU_1/SW0/PORT_6]
Recovery Action Check fault suspect list
Error UUID ff467ec4-9e98-11ed-a6b6-84160cc0f680
Cleared Timestamp N/A
]
The Functional State
field is Active
for all faults that require service actions
to recover the system or component state to Online
.
The Functional State
field is Clear
after the system has been
serviced, that is, after the recovery actions have been performed for the faulty component.
You can use snfadm -c
to clear a fault after servicing the faulty component.
Marking a fault as cleared does not fix the problem. Other actions are usually necessary to fix the underlying cause. |