SambaNova Fault Management (SNFM)

The SambaNova Fault Management (SNFM) framework supports reporting, diagnosing, and analyzing the system error and fault events associated with a DataScale system. The fault management framework can:

  • Take automatic recovery action for certain component failures

  • Advise corrective action to recover from faults.

For example:

  • RDUs might get reset automatically in certain situations.

  • You might see check component suspect list or reseat DIMM if you run snfadm -l fault and specify the verbose option. See Access fault information.

All SNFM dependencies are installed as part of the sambaflow package. SNFM is part of the SND (SambaNova Daemon) service.

SNFM focuses on events that are related to the DataScale modules that are part of a DataScale rack and some host components, for example, PCIE connectivity from host to XRDU. For example:

  • Reconfigurable Dataflow Unit (RDU) chips

  • Tiles within a RDU

  • Local PCIe connections between RDUs

  • RDU device memory

See the following blog articles on our public website for some background:

How to use SNFADM

The user interface to SNFM is the snfadm tool.

  • package: sambaflow

  • location: /opt/sambaflow/bin/snfadm

  • format: Python 3.7 executable.

To access SNFM programmatically, you can use the SambaNova Management Layer (SNML)

Command-line options

To see all command-line options for snfadm, run:

$ /opt/sambaflow/bin/snfadm --help

The command returns all arguments.

Required SNFADM arguments

Only -l is required. Use -l to list the different types of events, as discussed in:

Important optional SNFADM arguments

To see a full list of arguments, call /opt/sambaflow/bin/snfadm --help. Here are some important arguments:

  • The -a/--all-faults argument includes cleared faults in the output.

  • The -v/--verbose argument displays the long format output of an error or fault entry.

  • The --csv argument returns the output in CSV format, which is easy to import into a spreadsheet and easy to feed to an automation script for processing.

  • The -u/--uuid argument can be set to a single fault’s uuid to show information about only that fault or error.

  • The -c/--clear-fault argument marks a fault as cleared. Use --c/-clear-fault together with the -u/--uuid argument to mark a specific fault as cleared. Marking a fault as cleared does not necessarily fix the problem. Other actions are usually necessary to fix the underlying cause.

  • The -ca/--clear-all argument clears all faults at once.

SNFADM output

You have several choices for SNFADM output:

  • The default output format is short (tabular) format. Short format shows the contents of the inventory, error, and fault events. For the fault events, only active faults are shown by default.

  • If you specify -v/--verbose, the display includes details for each event.

  • If you specify --csv you get output in CSV format, which is easy to import into a spreadsheet and easy to feed to an automation script for processing.

Naming of DataScale components in SNFM output

To understand SNFADM output, you have to understand how a component name is a logical representation of the system topology. Here’s an example of a DIMM module name:

/NODE/XRDU_0/RDU_1/DDRCH_1/DIMM_A0

Each level of the hierarchy represents a component class with instance number and the levels delimited by /.

  • The name always starts with a forward slash delimiter /.

  • The host is the root of the hierarchy

  • The lastlevel component class is represented by the full name string.

  • We represent a component class name and the instance number as <component_class_name>_<instance_number>, for example, XRDU_3, DIMM_J1

Here are the component class representations:

NODE

The SN10-H or SN30-H host

XRDU

The SN10-2 or SN30-2 module. Each module contains 2 RDUs.

RDU

Reconfigurable Dataflow Unit within XRDU

TILE

Tiles within an RDU

DDRCH

RDU device memory channel

DIMM

Memory module of a DDRCH

Access inventory information

You access inventory information by calling the following command:

$ /opt/sambaflow/bin/snfadm -l inventory

The output includes information about each component that is expected to be present on a system, based on that system’s configuration (e.g. SN10-8).

The output includes this information:

  • Physical State. Present or Absent, indicating whether or not they were found on the system during a scan.

  • Functional State. Online, Degraded, Offline, Faulted, NA.

Here’s a partial inventory list of a system where an RDU is in Degraded state since one tile is Faulty. The RDU is functional but with reduced capacity. The Fault information shows the tile fault details.

Here are the first few lines of sample output:

Platform Name: DataScale SN20-8

Physical Inventory:
Component Name                        | Serial Number       | Inventory State| Functional State
/NODE/XRDU_0/RDU_0                    | 60604234E6D94715    | Present        | Online
/NODE/XRDU_0/RDU_0/DDRCH_0/DIMM_C0    | 44FEFCED            | Present        | Online
/NODE/XRDU_0/RDU_0/DDRCH_0/DIMM_C1    | 44FEFC41            | Present        | Online
/NODE/XRDU_0/RDU_0/DDRCH_1/DIMM_A0    | 44FEE128            | Present        | Online
/NODE/XRDU_0/RDU_0/DDRCH_1/DIMM_A1    | 44FEEEB8            | Present        | Online
/NODE/XRDU_0/RDU_0/DDRCH_2/DIMM_B0    | 44FEFDEA            | Present        | Online
[...]
/NODE/XRDU_0/RDU_0/PCIE_0             | N/A                 | Present        | Online
/NODE/XRDU_0/RDU_0/PCIE_1             | N/A                 | Present        | Online
/NODE/XRDU_0/RDU_0/PCIE_2             | N/A                 | Present        | Online
/NODE/XRDU_0/RDU_0/PCIE_3             | N/A                 | Present        | Online
/NODE/XRDU_0/RDU_0/TILE_0             | N/A                 | Present        | Online
/NODE/XRDU_0/RDU_1                    | 30484234E6D94715    | Present        | Online
[...]

Access error information

You access error information by calling the following command:

$ /opt/sambaflow/bin/snfadm -l error

The command returns details about the event source. Each error event is tagged with an UUID that the system uses to retrieve detailed information about the event and potentially associate the error event with a fault.

To see information about one error event, pass it in at the command line like this:

$ /opt/sambaflow/bin/snfadm -l error -v -u df67f9cb-9c1d-11ed-9939-d5ca835010b5

The output for one error might look like this:

[
	Original Timestamp  2023-01-24 11:32:43
	Last Timestamp      2023-01-24 11:32:43
	Error UUID          df67f9cb-9c1d-11ed-9939-d5ca835010b5
	Error Count         1
	Error Type          ETYPE_PCIE_LINK_HEALTH
	Component Name      /NODE/XRDU_0/RDU_0/PCIE_0
	Description         ETYPE_PCIE_LINK_HEALTH
	Fault UUID          00000000-0000-0000-0000-000000000000

]

Access fault information

Error events occuring on a system can lead to components being diagnosed as faulty. When that happens, the error event has a corresponding entry for the component in the fault log.

To list all active faults, run this command:

$ /opt/sambaflow/bin/snfadm -l fault

The output might look like this:

Timestamp           | Fault UUID                 | Fault Type            | Component Name                | Status
2022-09-13 17:10:31 | 7fcce2d9-33a8-11ed-83c1    | FTYPE_PCI_LINK_HEALTH | /NODE/XRDU_2/RDU_1/PCIE_1     | Active

Each fault is associated with a specific error UUID that can be used to retrieve information about relevant events (shown in the last line of the output below). If the fault UUID is 0, no associated diagnosed fault exists.

$ /opt/sambaflow/bin/snfadm -l fault -v -u ff467ec5-9e98-11ed-a6b6-84160cc0f680
[
        Timestamp           2023-01-27 15:19:07
        Fault UUID          ff467ec5-9e98-11ed-a6b6-84160cc0f680
        Fault Type          FTYPE_PCI_LINK_HEALTH
        Component Name      /NODE/XRDU_0/RDU_0/PCIE_4
        Functional State    Active
        Severity            SNFM_FATAL
        Serial Number       N/A
        Description         PCIE Link Fault [Suspect List: /NODE/XRDU_0/RDU_0/PCIE_4, /NODE/XRDU_0/QSFP/P7, /NODE/XRDU_1/SW0/PORT_6]
        Recovery Action     Check fault suspect list
        Error UUID          ff467ec4-9e98-11ed-a6b6-84160cc0f680
        Cleared Timestamp   N/A
]

The Functional State field is Active for all faults that require service actions to recover the system or component state to Online. The Functional State field is Clear after the system has been serviced, that is, after the recovery actions have been performed for the faulty component.

You can use snfadm -c to clear a fault after servicing the faulty component.

Marking a fault as cleared does not fix the problem. Other actions are usually necessary to fix the underlying cause.