SNML API reference (all users)

SNML contains APIs that are of interest to users of DataScale systems. Some examples include:

  • Checking the installed runtime version.

  • Checking the operational state of RDUs.

  • Requesting information about available and physically installed resources.

This reference gives details. See SambaNova Management Layer (SNML) for examples, discussion of the SNML gRPC server, and troubleshooting info.

GetSystemPlatformInfo

Retrieve the name of the DataScale platform you are currently using.

Summary

The GetSystemPlatformInfo API retrieves the name of the DataScale platform you are currently using.

Input

An object of type ResourceQuery.

Returns

/**
 * This data structure represents the version numbers of the components of the SNFM
 * fault management framework. It is mostly not interesting for external callers
 */
message SystemVersionInfo {
    int32 error_log_version_major = 1;
    int32 error_log_version_minor = 2;
    int32 fault_log_version_major = 3;
    int32 fault_log_version_minor = 4;
    int32 inv_log_version_major = 5;
    int32 inv_log_version_minor = 6;
    int32 policy_log_version_major = 7;
    int32 policy_log_version_minor = 8;
}


/**
 * This data structure is returned by the GetSystemPlaformInfo API
 * It contains information about what kind of DataScale system the
 * SNML server is running on
 */
message SystemPlatformInfo {
    repeated string error_details = 1;  // Any error details, if an error occurred
                                        // during the query
    uint32 request_id = 2;              // The request ID that the client provided
                                        // in the input
    string platform_name = 3;           // A string representing the name of the
                                        // DataScale platform
                                        // e.g. DataScale SN10-8 or DataScale SN30-8
    SystemVersionInfo ver_info = 4;     // A structure containing version numbers
                                        // internal to SambaNova Runtime
}

GetRduOperationalStatus

Retrieves operation state of RDUs.

Summary

The GetRduOperationalStatus API retrieves RDUs' operational states.

Each RDU’s operational state is tracked internally by the SambaNova Runtime stack. The operational state reflects whether the RDU can be used to run applications on that RDU at any given time. An RDU may be in a specific operational state due to physical presence, fault management policy, hardware issues, or other reasons. See Check the status of RDUs.

Input

/**
 * The request type for operational state API
 */
message RDUOperationalStatusRequest {
    uint32 request_id = 1;       // The request ID
    repeated uint32 rdu_ids = 2; // Each RDU has a component name like /NODE/XRDU_0/RDU_1
                                 // It also has a unique 1-number identifier
                                 // created from its component name.
                                 // To calculate the 1-number identifier, multiply the
                                 // (XRDU number) by the (number of RDUs per XRDU) [2]
                                 // and add the RDU number within the XRDU
                                 // This list reflects XRDU_0/RDU_0 and XRDU_0/RDU_1
}

Returns

This API returns a stream of RDUOperationalStatus objects

/**
 * The return type for a single RDU's operational state
 */
message RDUOperationalStatus {
    repeated string error_details = 1;  // Any error details from errors that occurred
    uint32 request_id = 2;              // The request ID provided by the caller
    uint32 rdu_id = 3;                  // The single number RDU ID described above
    RDUOperationalState state = 4;      // The RDU's current operational state
    string comp_name = 5;               // The RDU's component name
}

/**
 * This enumerator represents an RDU's operational status
 * Each value has a specific interpretation
 */
enum RDUOperationalState {
    RDU_STATE_FUNCTIONAL = 0;   // This RDU is fully healthy and can be used

    RDU_STATE_PENDING = 1;      // This RDU is currently unusable due to an operation
                                // being performed on it, such as a fault recovery event
                                // Continue polling for it to become FUNCTIONAL again.

    RDU_STATE_DEGRADED = 2;     // This RDU is usable but not in an optimal state.
                                // The SNFM framework will have more details on why

    RDU_STATE_UNAVAILABLE = 3;  // This RDU is unusable and all automatic recovery attempts
                                // have failed. Some human action is required to recover.

    RDU_STATE_ENUM_MAX = 4;     // Sentinel value
}

GetOnlineResource

Retrieve information about RDU resources that are currently online and managed by SambaNova Runtime.

Summary

The GetOnlineResource API is used for retrieving information about RDU resources that are currently online and managed by SambaNova Runtime. It returns a hierarchical data structure that represents the DataScale node it is describing.

  • The GetOnlineResource API provides dynamic information about the node (for example, how much RDU device memory is currently free).

  • Its sibling API, GetStaticResource, provides corresponding static information (for example, how many RDUs are physically installed in this system and what their serial numbers are).

The GetOnlineResource and GetStaticResource APIs share output data types, but many fields are populated by only one of the APIs.

Input

An object of type ResourceQuery.

Returns

/**
 * SnmlNodeData is the top-level return type of the data structure for
 * GetOnlineResource and GetStaticResource
 * It represents a whole node
 */
message SnmlNodeData {
    repeated string error_details = 1;      // Error details (if the query failed)
    uint32 request_id = 2;                  // Request ID for the API call
    repeated SnmlXrduData xrdu = 3;         // List of SnmlXrduData objects;
                                            // one for each RDU
    repeated SnmlVirtualRduData vrdu = 4;   // List of SnmlVirtualRduData objects;
                                            // one for each present vRDU
}

/**
 * This enumerator represents a component's inventory state:
 * - ABSENT if it is physically absent or unenumerated by the DataScale node
 * - PRESENT if it is physically enumerated by the DataScale node
 * - VM_PRESENT if it is enumerated, but its RDU resources are provisioned to a VF
 *   (virtual function) or if SambaNova Runtime detected that it's assigned to a VM
 * - UNKNOWN if the state cannot be checked. This is uncommon.
 */
enum InventoryState {
    ABSENT = 0;
    PRESENT = 1;
    VM_PRESENT = 2;
    UNKNOWN = 3;
}

/**
 * This data structure stores information that is applicable to any
 * type of component, and is a separate data structure to reduce duplication of fields.
 */
message ComponentInformation {
    InventoryState inv_state = 1;   // Every component has an inventory state,
                                    // described in the InventoryState enumerator.
    string ser_num = 2;             // Many components have a serial number,
                                    // which is returned as a string
                                    // For components that do not have serial numbers
                                    // (e.g. TILEs), the serial number is N/A
                                    // This field is only populated by the GetStaticInfo API
    string name = 3;                // Each component has a component name, which
                                    // specifies where in the component hierarchy it is.
                                    // For example, /NODE/XRDU_0/RDU_0/PCIE_1.
                                    // This field is only populated by the GetStaticInfo API
}

/**
 * This data structure represents a single XRDU on a DataScale system
 */
message SnmlXrduData {
    uint32 xrdu_id = 1;             // The XRDU's ID - from the component name
    repeated SnmlRduData rdu = 2;   // List of RDUs that are physically present
                                    // inside this XRDU
    repeated SnmlSwitchData switch = 3;
                                    // List of PCIe switches that are physically present
                                    // inside this XRDU
}

/**
 * This data structure represents an RDU within an RDU socket on an XRDU
 */
message SnmlRduData {
    uint32 rdu_id = 1;              // The RDU's socket ID within the XRDU
    uint32 n_avail_tiles = 2;       // Number of available tiles on this RDU
    uint64 ddrmem_sz = 3;           // DDR capacity of this RDU
    repeated <Internal> <i> = 4;    // Reserved
    bool is_perfect_rdu = 5;        // True if all tiles inside this RDU are healthy
    repeated SnmlTileData tile = 6; // List of SnmlTileData objects;
                                    // one for each tile on the RDU
    repeated SnmlPcieData pcie = 7; // List of SnmlPcieData objects;
                                    // one for each PCIe port on the RDU
    repeated SnmlMemoryData memory = 8; // List of SnmlMemoryData objects;
                                        // one for each DDR controller on the RDU
    repeated <Internal> <i> = 9;    // Reserved
    ComponentInformation info = 10; // Generic component information about the RDU
    repeated SnmlBDF bdf = 11;      // Multiple BDFs per RDU is possible
    uint32 en_ddr_chs = 12;         // Number of enabled DDR channels on this RDU
    uint64  <Internal> <i> = 13;    // Reserved
    repeated uint32 hep_id = 14;    // PCIe port ID of the host-facing PCIe endpoint
                                    // on this RDU
    repeated <Internal> <i> = 15;   // Reserved
}

/**
 * This data structure represents a PCIe BDF identifier
 */
message SnmlBDF {
    uint32 pci_bus = 1;         // The PCIe bus number
    uint32 pci_device = 2;      // The PCIe device identifier
    uint32 pci_function = 3;    // The PCIe function identifier
}

/**
 * This data structure represents an RDU tile inside an RDU
 */
message SnmlTileData {
    uint32 tile_id = 1;             // The tile's ID within the RDU it belongs to
    SnmlGraphData graph_info = 2;   // An optionally populated object representing
                                    // any applications that are using this tile
    ComponentInformation info = 3;  // The generic information relating to this tile
}

/**
 * This data structure represents a process that is
 * using an RDU component
 */
message SnmlGraphData {
    int32 graph_pid = 1; // The Linux PID of the application
}

/**
 * This data structure represents a PCIe port or link
 * either inside an RDU or a PCIe switch on a DataScale system
 */
message SnmlPcieData {
    uint32 pcie_id = 1;     // ID of this PCIe port relative to its parent component
    uint32 bandwidth = 2;   // Bandwidth in Gigabytes per second
                            // Bandwidth capacity for GetStaticResource
                            // and current bandwidth for GetOnlineResource
    uint32 speed = 3;       // The speed of the PCIe link in GT/sec
                            // Capacity speed for GetStaticResource and the current
                            // speed in GetOnlineResource
    uint32 width = 4;       // The PCIe link's width (number of lanes) capacity
                            // for GetStaticResource and current capacity
                            // for GetOnlineResource
    ComponentInformation info = 5; // Generic component info for this PCIe port
}




/**
 * This data structure reflects a DDR controller on the RDU
 */
message SnmlMemoryData {
    uint32 ddrch_num = 1;           // DDR controller's ID relative to
                                    // its parent RDU
    repeated SnmlDimmData dimm = 2; // A list of the DIMMs currently controlled
                                    // by this DDR controller
}

/**
 * This data structure represents a DIMM inside a DIMM slot on the XRDU
 */
message SnmlDimmData {
    string dimm_name = 1;           // The DIMM's name for identifying purposes
    uint64 size = 2;                // The capacity of the DIMM
    ComponentInformation info = 3;  // Component information about this DIMM
    bytes part_num = 4;             // The DIMM's part number
}

/**
 * A vRDU is the name for a SR-IOV VF of an RDU. A vRDU can be provisioned on an RDU PF
 * (physical function). Any VF RDUs that are currently provisioned are addressable
 * through SnmlVirtualRduData.
 *
 * When VF RDUs are provisioned, the corresponding PFs are still present, but the
 * corresponding physical tiles on the PF they were provisioned from are in the functional
 * state "virtualized" and are not usable through the PF. The PF remains present for
 * management purposes and for fault and error telemetry reporting on the memory/PCIe resources
 *
 * VF RDUs exist only in the GetOnlineResource context because they are inherently
 * transient virtual devices. For GetStaticResource, this list is empty.
 */
message SnmlVirtualRduData {
    uint32 vrdu_id = 1;         // The vRDU's unique identifier
    uint32 n_avail_tiles = 2;   // The number of available tiles inside the VF
                                // VFs may have the same number of tiles as the PF, or
                                // they may be sub-RDU VFs and have fewer.
    bool is_perfect_rdu = 3;    // True if all tiles expected to be present in the VF
                                // are present and online
    string topology = 4;        // The VF's shape, for example, 4t or 1t
    uint64 ddrmem_sz = 5;       // The amount of DDR memory assigned to the VF
    uint64 <Internal> <i> = 6;  // Reserved
    repeated SnmlVirtualTileData vtile = 7;
                                // List of SnmlVirtualTileData: one for each VTILE in the VF
    ComponentInformation info = 8;
                                // The ComponentInformation object that contains generic
                                // information about this VF
    uint32 pci_bus = 9;         // The VF's PCIe BDF identifier's bus number
    uint32 pci_device = 10;     // The VF's PCIe BDF identifier's device number
    uint32 pci_function = 11;   // The VF's PCIe BDF identifier's function number
}

/**
 * This data structure represents vTILEs - tiles that belong to vRDUs
 */
message SnmlVirtualTileData {
    uint32 tile_id = 1;             // The vTILE's ID within the RDU
    SnmlGraphData graph_info = 2;   // A data structure representing any graphs that are
                                    // currently using this vTILE
    ComponentInformation info = 3;  // The ComponentInformation object containing generic
                                    // information about this vTILE
}

/**
 * This data structure represents a PCIe switch inside an XRDU
 */
message SnmlSwitchData {
    uint32 switch_id = 1;           // The switch's ID, taken from the component name
                                    // /NODE/XRDU_0/SW_1 or SW_0
    uint32 num_ports = 2;           // Number of ports on the switch
    ComponentInformation info = 3;  // Generic component information for this switch
    repeated SnmlPcieData pcie = 4; // List of SnmlPcieData objects;
                                    // one for each port on the switch
}

GetStaticResource

Retrieve information about RDU resources that are physically configured on this DataScale system.

Summary

The GetStaticResource API is used for retrieving information about RDU resources that are physically configured on this DataScale system. This API returns a hierarchical data structure that represents the DataScale node it is describing.

  • The GetStaticResource API provides corresponding static information (for example, how many RDUs are physically installed in this system and what are their serial numbers).

  • Its sibling API GetOnlineResource provides dynamic information about the node (for example, how much RDU device memory is currently free).

The GetOnlineResource and GetStaticResource APIs share output data types, but many fields are populated by only one of the APIs.

Input

An object of type ResourceQuery.

Returns

This function returns the following information. For details about the component hierarchy, see Naming of DataScale components in SNFM output.

/**
 * SnmlNodeData is the top-level return type of the data structure for
 * GetOnlineResource and GetStaticResource
 * It represents a whole node
 */
message SnmlNodeData {
    repeated string error_details = 1;      // Any error details (if the query failed)
    uint32 request_id = 2;                  // The request ID for the API call
    repeated SnmlXrduData xrdu = 3;         // List of SnmlXrduData objects; one for each RDU
    repeated SnmlVirtualRduData vrdu = 4;   // List of SnmlVirtualRduData objects;
                                            // one for each present vRDU
}

/**
 * This enumerator represents a component's Inventory state.
 * A component's inventory state is:
 * - ABSENT if it is physically absent or unenumerated by the DataScale node
 * - PRESENT if it is physically enumerated by the DataScale node
 * - VM_PRESENT if it is enumerated, but its RDU resources are provisioned to a VF,
 *   or if SambaNova Runtime has detected that it is assigned to a virtual machine
 * - UNKNOWN if the state cannot be checked. This is uncommon.
 */
enum InventoryState {
    ABSENT = 0;
    PRESENT = 1;
    VM_PRESENT = 2;
    UNKNOWN = 3;
}

/**
 * This data structure stores information that is applicable to any
 * type of component, and is a separate data structure to reduce duplication of fields.
 */
message ComponentInformation {
    InventoryState inv_state = 1;
                        // Every component has an inventory state, as described in the
                        // InventoryState enumerator.
    string ser_num = 2; // Many components have a serial number, which is returned as a string
                        // For components that do not have serial numbers (like TILEs), the
                        // serial number is N/A
                        // This field is populated only by the GetStaticInfo API
    string name = 3;    // Each component has a component name, which specifies where in the
                        // component hierarchy it is (e.g. /NODE/XRDU_0/RDU_0/PCIE_1.)
                        // This field is populated only by the GetStaticInfo API
}

/**
 * This data structure represents a single XRDU on a DataScale system
 */
message SnmlXrduData {
    uint32 xrdu_id = 1;             // The XRDU's ID - from the component name
    repeated SnmlRduData rdu = 2;   // List of RDUs that are physically present inside this XRDU
    repeated SnmlSwitchData switch = 3;
                                    // List of PCIe switches that are physically
                                    // present inside this XRDU
}



/**
 * This data structure represents an RDU within an RDU socket on an XRDU
 */
message SnmlRduData {
    uint32 rdu_id = 1;              // The RDU's socket ID within the XRDU
    uint32 n_avail_tiles = 2;       // Number of available tiles on this RDU
    uint64 ddrmem_sz = 3;           // DDR capacity of this RDU
    repeated <Internal> <i> = 4;    // Reserved
    bool is_perfect_rdu = 5;        // True if all tiles inside this RDU are healthy
    repeated SnmlTileData tile = 6; // List of SnmlTileData objects;
                                    // one for each tile on the RDU
    repeated SnmlPcieData pcie = 7; // List of SnmlPcieData objects;
                                    // one for each PCIe port on the RDU
    repeated SnmlMemoryData memory = 8; // List of SnmlMemoryData objects;
                                        // one for each DDR controller on the RDU
    repeated <Internal> <i> = 9;    // Reserved
    ComponentInformation info = 10; // Generic component information about the RDU as a whole
    repeated SnmlBDF bdf = 11;      // Multiple BDFs per RDU is possible
    uint32 en_ddr_chs = 12;         // Number of enabled DDR channels on this RDU
    uint64  <Internal> <i> = 13;    // Reserved
    repeated uint32 hep_id = 14;    // PCIe port ID of the host-facing PCIe endpoint
                                    // on this RDU
    repeated <Internal> <i> = 15;   // Reserved
}

/**
 * This data structure represents a PCIe BDF identifier
 */
message SnmlBDF {
    uint32 pci_bus = 1;         // The PCIe bus number
    uint32 pci_device = 2;      // The PCIe device identifier
    uint32 pci_function = 3;    // The PCIe function identifier
}

/**
 * This data structure represents an RDU tile inside an RDU
 */
message SnmlTileData {
    uint32 tile_id = 1;             // The tile's ID within the RDU it belongs to
    SnmlGraphData graph_info = 2;   // An optionally populated object representing
                                    // any applications that are using this tile
    ComponentInformation info = 3;  // The generic information relating to this tile
}

/**
 * This data structure represents a process that is
 * using an RDU component
 */
message SnmlGraphData {
    int32 graph_pid = 1;            // The Linux PID of the application
}

/**
 * This data structure represents a PCIe port or link
 * either inside an RDU or a PCIe switch on a DataScale system
 */
message SnmlPcieData {
    uint32 pcie_id = 1;     // ID of this PCIe port relative to its parent component
    uint32 bandwidth = 2;   // Bandwidth in Gigabytes per second
                            // Bandwidth capacity for GetStaticResource and current
                            // bandwidth for GetOnlineResource
    uint32 speed = 3;       // Speed of the PCIe link in GT/sec
                            // Capacity speed for GetStaticResource and the current
                            // speed in GetOnlineResource
    uint32 width = 4;       // PCIe link's width (number of lanes) capacity for
                            // GetStaticResource and current for GetOnlineResource
    ComponentInformation info = 5; // Generic component info for this PCIe port
}
/**
 * This data structure reflects a DDR controller on the RDU
 */
message SnmlMemoryData {
    uint32 ddrch_num = 1;           // The DDR controller's ID
                                    // relative to its parent RDU
    repeated SnmlDimmData dimm = 2; // List of the DIMMs currently controlled
                                    // by this DDR controller
}

/**
 * This data structure represents a DIMM inside a DIMM slot on the XRDU
 */
message SnmlDimmData {
    string dimm_name = 1;           // The DIMM's name for identifying purposes
    uint64 size = 2;                // The capacity of the DIMM
    ComponentInformation info = 3;  // Component information about this DIMM
    bytes part_num = 4;             // The DIMM's part number
}

/**
 * A vRDU is the name for a SR-IOV VF of an RDU. vRDUs can be provisioned on RDU PFs
 * Any VF RDUs that are currently provisioned will be addressable through this list.
 *
 * When VF RDUs are provisioned, the corresponding PFs will still be present, but the
 * corresponding physical tiles on the PF they were provisioned from are in the functional
 * state "virtualized" and they are not usable through the PF. The PF remains present for
 * management purposes and for fault or error telemetry reporting on the memory/PCIe resources
 *
 * VF RDUs exist only in the GetOnlineResource context because they are inherently
 * transient virtual devices. For GetStaticResource, this list is empty.
 */
message SnmlVirtualRduData {
    uint32 vrdu_id = 1;         // The vRDU's unique identifier
    uint32 n_avail_tiles = 2;   // The number of available tiles inside the VF
                                // VFs may have the same number of tiles as the PF, or
                                // they may be sub-RDU VFs and have fewer.
    bool is_perfect_rdu = 3;    // True if all tiles expected to be present in the VF
                                // are present and online
    string topology = 4;        // The VF's shape: 4t or 1t, for example
    uint64 ddrmem_sz = 5;       // The amount of DDR memory assigned to the VF
    uint64 <Internal> <i> = 6;  // Reserved
    repeated SnmlVirtualTileData vtile = 7;
                                // List of SnmlVirtualTileData: one for each VTILE in the VF
    ComponentInformation info = 8;
                                // The ComponentInformation object containing generic
                                // information about this VF
    uint32 pci_bus = 9;         // The VF's PCIe BDF identifier's bus number
    uint32 pci_device = 10;     // The VF's PCIe BDF identifier's device number
    uint32 pci_function = 11;   // The VF's PCIe BDF identifier's function number
}

/**
 * This data structure represents vTILEs - tiles that belong to vRDUs
 */
message SnmlVirtualTileData {
    uint32 tile_id = 1;             // The vTILE's ID within the RDU
    SnmlGraphData graph_info = 2;   // A data structure representing any graphs that are
                                    // currently using this vTILE
    ComponentInformation info = 3;  // The ComponentInformation object containing generic
                                    // information about this vTILE
}

/**
 * This data structure represents a PCIe switch inside an XRDU
 */
message SnmlSwitchData {
    uint32 switch_id = 1;           // The switch's ID, taken from the component name
                                    // e.g. /NODE/XRDU_0/SW_1 or SW_0
    uint32 num_ports = 2;           // The number of ports on the switch
    ComponentInformation info = 3;  // Generic component information pertaining to this switch
    repeated SnmlPcieData pcie = 4; // A list of SnmlPcieData objects;
                                    // one for each port on the switch
}

GetSystemFaultState

Summary

The GetSystemFaultState API retrieves information about inventory components on the system (physical components of the DataScale node) and what functional state they are in. In contrast with the GetSystemFaultLog API, which returns information about every fault that is diagnosed, this API returns information about every component on the system and what functional state it is in. The GetSystemFaultState API only returns an entry for a component if its state is not ONLINE. If the GetSystemFaultState API returns an empty list, all components on the system are in a healthy state.

Input

An object of type ResourceQuery.

Returns

This API returns a stream of objects of type SystemFaultState. Generally, users can treat a client-side-stream based API like an API that promises to return a list of indeterminate length. See the gRPC documentation on client-side stream handling for details.

/**
 * This data structure represents a single copmponent
 * The SystemFaultState returns a list of these
 */
message SystemFaultState {
    repeated string error_details = 1;  // Error details for errors that occurrred
                                        // during this query
    uint32 request_id = 2;              // The request ID provided by the caller
    string comp_name = 3;               // Name of the component this entry describes
    string serial_num = 4;              // Serial number of the component, if applicable
    string fault_state = 6;             // Functional state that this component is in
}

GetSystemFaultLog

Retrieve the system fault information from the SambaNova Fault Management (SNFM) framework.

Summary

The GetSystemFault API retrieves the system fault information from the SambaNova Fault Management (SNFM) framework. The SNFM framework captures telemetry about faults, errors, and the system’s physical and virtual inventory, and caches historical and current information for user consumption.

The API returns all diagnosed faults that currently exist on this system, including active faults and faults that have been cleared by automatically by SambaNova Runtime or manually by a system administrator.

An hardware fault is different from a hardware event:

  • A fault means that the SNFM framework decided to diagnose a certain hardware component as faulty because some number of events occurred to it that made SNFM suspicious that something might be wrong with that hardware.

  • An event is merely an event that occurred in hardware.

Input

An object of type SystemFaultLogQuery. This object is used for selecting which entries in the database are returned to the caller.

If you provide a default object (for example, if you create a Python SystemFaultLogQuery with the code SystemFaultLogQuery()) all entries are returned.

/**
 * A query object for the System Fault Log
 * Used for the caller to specify what kinds of entries should be returned
 */
message SystemFaultLogQuery {
    uint32 request_id = 1;      // The request ID, for caller's consumption
    uint64 start_timestamp = 2; // The starting timestamp of events to return.
                                // This is a UNIX timestamp, and all events from after this
                                // timestamp will be returned.
    uint64 end_timestamp = 3;   // The ending timestamp of events to return.
                                // This is a UNIX timestamp, and all events from before this
                                // timestamp will be returned.
    string fault_uuid = 4;      // A UUID identifying a specific error event. If this field
                                // is set, only that error is returned.
    string comp_name = 5;       // Component name for requesting only faults on
                                // a single component
    string fault_type = 6;      // Fault type if only a single fault type is of interest
    string fault_state = 7;     // Fault state of the component affected by this fault
    string err_uuid = 8;        // UUID of the error that caused this fault to be diagnosed
    string severity = 9;        // Severity of this fault CRITICAL, FATAL, etc.
    bool vrdu_database = 10;    // Set this flag to True if you want information about
                                // faults that occurred on vRDUs instead of physical RDUs
}

Returns

This API returns a stream of objects of type SystemFaultLog. Generally, users can treat a client-side-stream based API like an API that promises to return a list of indeterminate length. See the gRPC documentation on client-side stream handling for details.

For more information about the data returned by this API, see the documentation for SNFM and the System Policy Data information retrievable through SNFADM or the SNML GetSystemPolicyInfo API.

/**
 * This data structure represents a single SNFM Fault Log entry
 * The GetSystemFaultLog API returns a stream of events of this type
 */
message SystemErrorLog {

    uint32 request_id = 1;          // The request ID provided by the caller
    fixed64 timestamp = 2;          // The timestamp when this fault was diagnosed
                                    // This is a UNIX timestamp
    string fault_uuid = 3;          // The UUID identifying this fault
    string severity = 4;            // Severity of this fault CRITICAL, FATAL, etc.
    string comp_name = 5;           // Component name affected by this fault
    bytes comp_ser_num = 6;         // Serial number of the compoonent affected by this fault
    string fault_desc = 7;          // Human-readable fault description with more
                                    // details on what this fault means
    string recovery_act = 8;        // A human-readable recovery action that administrators
                                    // can take to recover from this fault
    string fault_type = 9;          // Name of this fault type
    string err_uuid = 5;            // UUID identifying the error event that cuased
                                    // this fault to be diagnosed
    fixed64 cleared_timestamp = 12; // The timestamp that this fault was marked cleared,
                                    // if it is cleared. Zero means the fault is currently
                                    // active. This is a unix timestamp.
    repeated string error_details = 13;  // Error details, if any error occurred
                                         // while constructing this response
}

GetSystemErrorLog

Retrieve system events from the SambaNova Fault Management (SNFM) framework.

Summary

The GetSystemErrorLog API returns the hardware events (mostly error events) that have occurred on the system since it was configured or since the log was manually refreshed. An hardware event is different from a hardware fault:

  • An event is merely an event that occurred in hardware.

  • A fault, on the other hand, is a decision made by the SNFM framework to diagnose a certain hardware component as faulty, because some number of events occurred to it that made SNFM suspicious that something might be wrong with that hardware.

Input

An object of type SystemErrorLogQuery. This object is used for selecting what entries in the database should be returned to the caller.

If you provide a default object (for example, if you create a Python SystemErrorLogQuery with the code SystemErrorLogQuery()) all entries are returned.

/**
 * A query object for the System Error Log
 * Used for the caller to specify what kinds of entries should be returned
 */
message SystemErrorLogQuery {
    uint32 request_id = 1;      // The request ID, for caller's consumption
    uint64 start_timestamp = 2; // The starting timestamp of events to return.
                                // This is a UNIX timestamp, and all events from after this
                                // timestamp will be returned.
    uint64 end_timestamp = 3;   // The ending timestamp of events to return.
                                // This is a UNIX timestamp, and all events from before this
                                // timestamp will be returned.
    string err_uuid = 4;        // A UUID identifying a specific error event. If this field
                                // is set, only that error will be returned.
    uint32 err_count_min = 5;   // The bottom of a range of error counts to consider
    uint32 err_count_max = 6;   // The top of a range of error counts to consider
    string err_type = 7;        // The string of the name of a specific type of error
                                // for example, ETYPE_TILE_HANG.
                                // If passed, only those events will be returned
    string comp_name = 8;       // Component name for specifying only errors on
                                // a single component
    string fault_uuid = 9;      // A UUID identifying a specific diagnosed fault.
                                // If this field is set, only errors that led to that fault
                                // being diagnosed will be returned.
    bool vrdu_database = 10;    // Set this flag to True if you want information about
                                // errors that occurred on vRDUs instead of physical RDUs
}

Returns

This API returns a stream of objects of type SystemErrorLog. Generally, users can treat a client-side-stream based API like an API that promises to return a list of indeterminate length. See the gRPC documentation on client-side stream handling for details.

For more information about the data returned by this API, see the documentation for SNFM and the System Policy Data information retrievable through SNFADM or the SNML GetSystemPolicyInfo API.

/**
 * This data structure represents a single SNFM Error Log entry
 * The GetSystemErrorLog API returns a stream of events of this type
 */
message SystemErrorLog {
    repeated string error_details = 1;  // Error details, if any error occurred
                                        // while constructing this response
    uint32 request_id = 2;              // Request ID provided by the caller
    fixed64 orig_timestamp = 3;         // First timestamp this error occurred
                                        // This is a UNIX timestamp
    fixed64 last_timestamp = 4;         // Most recent timestamp this error occurred
                                        // This is a UNIX timestamp
    string err_uuid = 5;                // UUID identifying this error event
    uint32 err_count = 6;               // Number of errors of this type that were
                                        // recorded between the times specified
                                        // in the two timestamps above
    string err_type = 7;                // Error type that this entry reflects
    string err_data = 8;                // Architecture-specific error data
    string err_type_desc = 9;           // Human-readable error type description with
                                        // details on what this error means
    string comp_name = 10;              // Component name this error occurred on
    string fault_uuid = 11;             // Fault UUID if this error led to a fault
                                        // being diagnosed. For errors that did not
                                        // lead to a fault, this will be all zeros
}

GetHangRecoveryHistory

Summary

The GetHangRecoveryHistory API retrieves all recorded hang-recovery events that have happened on the RDU or RDUs specified in the query.

The SNFM framework tracks hangs that are successfully and unsuccessfully recovered from and can return information about such events in this API.

Input

/**
 * This is the request type for the GetHangRecoveryHistory API
 * The rdu_ids field specifies which RDUs to return information about
 */
message HangRecoveryHistoryRequest{
    uint32 request_id = 1;          // Request ID
    repeated uint32 rdu_ids = 2;    // List of RDU IDs to query about
}

Returns

This API returns a stream of HangRecoveryHistory objects; one for each RDU in the request list.

/**
 * This data structure represents the entire hang recovery history for one RDU
 */
message HangRecoveryHistory {
    repeated string error_details = 1;  // Error details, if applicable
    uint32 request_id = 2;              // Request ID provided by caller
    uint32 rdu_id = 3;                  // The RDU ID that this object pertains to
    repeated HangRecoveryEvent hang_recovery_log = 4;
                                        // A list of HangRecoveryEvents
                                        // that occurred on this RDU
}

/**
 * This data structure represents a single hang-recovery event
 */
message HangRecoveryEvent{
    uint32 rdu_id = 3;                      // The RDU that was hang-recovered
    fixed64 timestamp = 4;                  // UNIX timestamp when the event occurred
    HangRecoveryEventType event_type = 6;   // Kind of hang recovery event
    HangRecoveryOutcome outcome = 7;        // Outcome of the hang recovery event
    string comp_name = 8;                   // Component name of the affected RDU or tile
}

/**
 * This enumerator specifies the different kinds of hang recovery events
 */
enum HangRecoveryEventType {
    TILE_RESET = 0;             // A tile-level reset, on a single RDU tile
    CHIP_RESET = 1;             // A chip-level reset, applying to a group of 4 tiles
    OTHER = 2;                  // Something else, like a DC power cycle, for example
    HANG_RECOVERY_ENUM_MAX = 3; // Sentinel value
}

/**
 * This enumerator represents the outcome of a
 * hang-recovery event
 */
enum HangRecoveryOutcome {
    IN_PROGRESS = 0;    // The event is currently in progress
    SUCCEEDED = 1;      // The hang recovery was successful
    FAILED = 2;         // The hang recovery failed and the tile was diagnosed as faulty
    PENDING = 3;
}

GetRuntimeVersion

Retrieve information about the system and the currently installed SambaNova Runtime software on the system.

Summary

The GetRuntimeVersion API can be used to retrieve information about the system and the currently installed SambaNova Runtime software on the system. It takes an object of type ResourceQuery with no required fields as input, and returns an object of type RuntimeVersionInfo. The RuntimeVersionInfo object contains the versions of several components on the system.

  • The SambaFlow version field represents the version of sambaflow, sambanova-runtime, and any other SambaFlow related packages that are installed on the system.

  • The other version numbers reflect semantic versions of interfaces that are supported by the currently installed SambaNova Runtime stack.

  • The PEF version reflects the version of PEF that the SambaNova Runtime package was built against. SambaNova Runtime support PEFs generated against any PEF version compatible with that version(following semantic versioning rules). Depending on your situation, you either upgrade Runtime or recompile your PEF.

  • The SNML and SNML Admin versions show semantic versions for the two SNML services.

Input

An object of type ResourceQuery.

Returns

/**
 * The SemanticVersion data structure represents a 3-part semantic version
 * where the effects of version changes meet the SemVer standards
 */
message SemanticVersion {
    uint32 major = 1;
    uint32 minor = 2;
    uint32 patch = 3;
}

/**
 * This is the output type for the GetRuntimeVersion
 */
message RuntimeVersionInfo{
    repeated string error_details = 1;
    uint32 request_id = 2;
    string sambaflow_version = 3;           // String representing the installed
                                            // SambaFlow version (e.g. 1.16.2)
    SemanticVersion runtime_if_version = 4; // Internal to SambaFlow
    SemanticVersion samba_runtime_version = 5;
                                            // Interface between SambaFlow and
                                            // SambaNova Runtime
    SemanticVersion pef_version = 6;        // PEF version supported by the installed Runtime
    SemanticVersion snml_version = 7;       // Version of SNML service currently running
    SemanticVersion snml_admin_version = 9; // Version of SNML Admin service currently
                                            // running
}

GetSystemPolicyInfo

Retrieve the system’s fault-diagnosis policy from the SambaNova Fault Management (SNFM) framework.

Summary

The GetSystemPolicyInfo API retrieves the system’s fault-diagnosis policy from the SambaNova Fault Management (SNFM) framework. The SNFM framework captures telemetry about faults, errors, and the system’s physical and virtual inventory, and caches historical and current information for users' consumption. See SambaNova Fault Management (SNFM) for background information.

The system fault policy is used to decide how errors and events that occur on the system should lead to SNFM faults being diagnosed. This API allows users to inspect the rules that SNFM uses to diagnose faults.

Input

An object of type SystemPolicyInfoQuery. This object has only one field - the request ID.

message SystemPolicyInfoQuery {
    uint32 request_id = 1;  // Used to identify the caller from the matching
                            // request_id in the output field.
}

Returns

This API returns a stream of objects of type SystemPolicyInfo. Generally, users can treat a client-side-stream based API like an API that promises to return a list of indeterminate length. See the gRPC documentation on client-side stream handling for details.

For more information about the data returned by this API, see SambaNova Fault Management (SNFM).

/**
 * This data structure represents a single system fault policy entry
 * The GetSystemPolicyInfo API returns a stream of events of this type
 */
message SystemPolicyInfo {
    uint32 request_id = 1;              // Request ID provided by the caller
    string fault_type = 2;              // Type of fault that this policy relates to
    string error_type = 3;              // Type of error that this policy will monitor
                                        // to diagnose faults of the type specified above
    string action = 4;                  // Action that SNFM takes on the faulted component
                                        // when this fault is diagnosed
    string severity = 5;                // The severity of this fault
    string fault_desc = 6;              // Human readable brief description of this fault
    string fault_detail_desc = 7;       // Human readable detailed description of this fault
    string recovery_act = 8;            // Recovery action for this kind of fault
    repeated string error_details = 9;  // Architecture-specific error details
}

GetSystemVirtualInventory

Retrieve information about virtual RDUs (vRDUs). A vRDU is the SR-IOV virtual function (VF) of an RDU.

Summary

The GetSystemVirtualInventory API retrieves information about vRDUs that have been provisioned against physical RDUs on the system.

A vRDU is the SR-IOV VF of an RDU. vRDUs can be provisioned on RDU PFs. Any VF RDUs that are currently provisioned will be addressable through this list.

When VF RDUs are provisioned, the corresponding PFs will still be present, but the corresponding physical tiles on the PF they were provisioned from are in the functional state "virtualized" and they are not usable through the PF. The PF remains present for management purposes and for fault and error telemetry reporting on the memory and PCIe resources.

Input

An object of type SystemVirtualInventoryQuery. This object has only one field - the request ID.

message SystemVirtualInventoryQuery{
    uint32 request_id = 1;
}

Returns

This API returns a stream of objects of type SystemVirtualInventoryData. Generally, users can treat a client-side-stream based API like an API that promises to return a list of indeterminate length. See the gRPC documentation on client-side stream handling for details.

/**
 * This data structure represents a single vRDU
 * The GetSystemVirtualInventory returns a list of these
 */
message SystemVirtualInventoryData {
    repeated string error_details = 1;  // Error details for errors that occurrred
                                        // during this query
    uint32 request_id = 2;              // The request ID provided by the caller
    string comp_name = 3;               // Cmponent name of the vRDU (e.g. /NODE/VRDU_8)
    string serial_num = 4;              // Serial number of the PF that the VF belongs to
    string part_number = 5;             // Part number. This field is unused
    string fault_state = 6;             // Functional state that this vRDU is in
    string pf_name = 7;                 // Component name of the vRDU's parent
                                        // physical RDU - the RDU that this VF
                                        // was provisioned from
}

GetSystemLocalPcieRoute

Returns a pairwise list of all the RDUs on this node and the number of routes between them.

Summary

The GetSystemLocalPcieRoute API returns a pairwise list of all the RDUs on this node and the number of routes between them via the PCIe local fabric. It is useful for checking communication bandwidth between any two RDUs.

This API always returns information about all RDUs.

Input

An object of type ResourceQuery

Returns

/**
 * The return type for SystemLocalPcieRoute
 */
message SystemLocalPcieRouteData {
    repeated string error_details = 1;  // Error details, if applicable
    uint32 request_id = 2;              // Request ID, provided by caller
    repeated RduPairData rdu_pairs = 3; // List of directional pairs, described below

    SnmlNodeData port_status = 4;       // Contains the port statuses for all ports
                                        // on the system. For details on this data
                                        // structure, see the documentation for
                                        // GetOnlineResource and GetStaticResource
}

/**
 * This data structure represents an ordered pair of (src, dest) RDUs
 * that are connected by local PCIe within one system.
 */
message RduPairData {
    uint32 xrdu_src_id = 1;         // The source XRDU ID
    uint32 rdu_src_id = 2;          // The source RDU ID within the XRDU
    uint32 xrdu_dst_id = 3;         // The dest XRDU ID
    uint32 rdu_dst_id = 4;          // The dest RDU ID within the XRDU
    uint32 n_expected_routes = 5;   // Number of routes we expect on a fully healthy system
    uint32 n_actual_routes = 6;     // Number of actual routes available right now
    repeated LocalPcieRouteData routes = 7; // Details about each route
}

/**
 * This data structure provides detailed information
 * about one PCIe route in the RduPair
 */
message LocalPcieRouteData {
    uint32 src_pcie_id = 1; // The PCIe port ID on the source side
    uint32 dst_pcie_id = 2; // The PCIe port ID on the dest side
    bool route_up = 3;      // Is the route up?
}