SNML API reference (all users)
SNML contains APIs that are of interest to users of DataScale systems. Some examples include:
-
Checking the installed runtime version.
-
Checking the operational state of RDUs.
-
Requesting information about available and physically installed resources.
This reference gives details. See SambaNova Management Layer (SNML) for examples, discussion of the SNML gRPC server, and troubleshooting info.
GetSystemPlatformInfo
Retrieve the name of the DataScale platform you are currently using.
This API has changed in SambaFlow 1.17. Going forward, you get version info with the new GetSystemVersionInfo API.
|
Summary
The GetSystemPlatformInfo
API retrieves the name of the DataScale platform you are currently using.
Returns
/**
* This data structure is returned by the GetSystemPlaformInfo API
* It contains information about what kind of DataScale system the
* SNML server is running on
*/
message SystemPlatformInfo {
repeated string error_details = 1; // Any error details, if an error occurred
// during the query
uint32 request_id = 2; // The request ID that the client provided
// in the input
string platform_name = 3; // A string representing the name of the
// DataScale platform
// e.g. DataScale SN10-8 or DataScale SN30-8
}
GetRduOperationalStatus
Retrieves operation state of RDUs.
Summary
The GetRduOperationalStatus
API retrieves RDUs' operational states.
Each RDU’s operational state is tracked internally by the SambaNova Runtime stack. The operational state reflects whether the RDU can be used to run applications on that RDU at any given time. An RDU may be in a specific operational state due to physical presence, fault management policy, hardware issues, or other reasons. See Check the status of RDUs.
Input
/**
* The request type for operational state API
*/
message RDUOperationalStatusRequest {
uint32 request_id = 1; // The request ID
repeated uint32 rdu_ids = 2; // Each RDU has a component name like /NODE/XRDU_0/RDU_1
// It also has a unique 1-number identifier
// created from its component name.
// To calculate the 1-number identifier, multiply the
// (XRDU number) by the (number of RDUs per XRDU) [2]
// and add the RDU number within the XRDU
// This list reflects XRDU_0/RDU_0 and XRDU_0/RDU_1
}
Returns
This API returns a stream of RDUOperationalStatus
objects.
/**
* The return type for a single RDU's operational state
*/
message RDUOperationalStatus {
repeated string error_details = 1; // Any error details from errors that occurred
uint32 request_id = 2; // The request ID provided by the caller
uint32 rdu_id = 3; // The single number RDU ID described above
RDUOperationalState state = 4; // The RDU's current operational state
string comp_name = 5; // The RDU's component name
}
/**
* This enumerator represents an RDU's operational status
* Each value has a specific interpretation
*/
enum RDUOperationalState {
RDU_STATE_FUNCTIONAL = 0; // This RDU is fully healthy and can be used
RDU_STATE_PENDING = 1; // This RDU is currently unusable due to an operation
// being performed on it, such as a fault recovery event
// Continue polling for it to become FUNCTIONAL again.
RDU_STATE_DEGRADED = 2; // This RDU is usable but not in an optimal state.
// The SNFM framework will have more details on why
RDU_STATE_UNAVAILABLE = 3; // This RDU is unusable and all automatic recovery attempts
// have failed. Some human action is required to recover.
RDU_STATE_ENUM_MAX = 4; // Sentinel value
}
GetOnlineResource
Retrieve information about RDU resources that are currently online and managed by SambaNova Runtime.
Summary
The GetOnlineResource
API is used for retrieving information about RDU resources that are currently online and managed by SambaNova Runtime. It returns a hierarchical data structure that represents the DataScale node it is describing.
-
The
GetOnlineResource
API provides dynamic information about the node (for example, how much RDU device memory is currently free). -
Its sibling API,
GetStaticResource
, provides corresponding static information (for example, how many RDUs are physically installed in this system and what their serial numbers are).
The GetOnlineResource
and GetStaticResource
APIs share output data types, but many fields are populated by only one of the APIs.
The handling of RDU device memory has changed in SambaFlow 1.17. The data structures have a new hierarchical format, and some names have changed. |
The SnmlVrduData type has been deprecated and the SnmlNodeData.vrdu field now consists of SnmlRduData objects.
|
Returns
/**
* SnmlNodeData is the top-level return type of the data structure for
* GetOnlineResource and GetStaticResource
* It represents a whole node
*/
message SnmlNodeData {
repeated string error_details = 1; // Error details (if the query failed)
uint32 request_id = 2; // Request ID for the API call
repeated SnmlXrduData xrdu = 3; // List of SnmlXrduData objects;
// one for each RDU
repeated SnmlRduData vrdu = 4; // List of SnmlRduData objects;
// one for each present vRDU
}
/**
* This enumerator represents a component's inventory state:
* - ABSENT if it is physically absent or unenumerated by the DataScale node
* - PRESENT if it is physically enumerated by the DataScale node
* - VM_PRESENT if it is enumerated, but its RDU resources are provisioned to a VF
* (virtual function) or if SambaNova Runtime detected that it's assigned to a VM
* - UNKNOWN if the state cannot be checked. This is uncommon.
*/
enum InventoryState {
ABSENT = 0;
PRESENT = 1;
VM_PRESENT = 2;
UNKNOWN = 3;
}
/**
* This data structure stores information that is applicable to any
* type of component, and is a separate data structure to reduce duplication of fields.
*/
message ComponentInformation {
InventoryState inv_state = 1; // Every component has an inventory state,
// described in the InventoryState enumerator.
string ser_num = 2; // Many components have a serial number,
// which is returned as a string
// For components that do not have serial numbers
// (e.g. TILEs), the serial number is N/A
// This field is only populated by the
// GetStaticInfo API
string name = 3; // Each component has a component name, which
// specifies where in the component hierarchy it is.
// For example, /NODE/XRDU_0/RDU_0/PCIE_1.
// This field is only populated by the
// GetStaticInfo API
}
/**
* This data structure represents a single XRDU on a DataScale system
*/
message SnmlXrduData {
uint32 xrdu_id = 1; // The XRDU's ID - from the component name
repeated SnmlRduData rdu = 2; // List of RDUs that are physically present
// inside this XRDU
repeated SnmlSwitchData switch = 3;
// List of PCIe switches that are physically present
// inside this XRDU
}
/**
* This data structure represents an RDU within an RDU socket on an XRDU
*/
message SnmlRduData {
uint32 rdu_id = 1; // The RDU's socket ID within the XRDU
uint32 n_avail_tiles = 2; // Number of available tiles on this RDU
uint64 ddrmem_sz = 3; // DDR capacity of this RDU
repeated <Internal> <i> = 4; // Reserved
bool is_perfect_rdu = 5; // True if all tiles inside this RDU are healthy
repeated SnmlTileData tile = 6; // List of SnmlTileData objects;
// one for each tile on the RDU
repeated SnmlPcieData pcie = 7; // List of SnmlPcieData objects;
// one for each PCIe port on the RDU
SnmlMemoryData memory = 8; // Summary of the RDU's device memory
ComponentInformation info = 10; // Generic component information about the RDU
repeated SnmlBDF bdf = 11; // Multiple BDFs per RDU is possible
uint64 <Internal> <i> = 13; // Reserved
repeated uint32 hep_id = 14; // PCIe port ID of the host-facing PCIe endpoint
// on this RDU
repeated <Internal> <i> = 15; // Reserved
string topology = 16; // "PF" for PFs, or the VF's topology for VFs
}
/**
* This data structure represents a PCIe BDF identifier
*/
message SnmlBDF {
uint32 pci_bus = 1; // The PCIe bus number
uint32 pci_device = 2; // The PCIe device identifier
uint32 pci_function = 3; // The PCIe function identifier
}
/**
* This data structure represents an RDU tile inside an RDU
*/
message SnmlTileData {
uint32 tile_id = 1; // The tile's ID within the RDU it belongs to
SnmlGraphData graph_info = 2; // An optionally populated object representing
// any applications that are using this tile
ComponentInformation info = 3; // The generic information relating to this tile
}
/**
* This data structure represents a process that is
* using an RDU component
*/
message SnmlGraphData {
int32 graph_pid = 1; // The Linux PID of the application
}
/**
* This data structure represents a PCIe port or link
* either inside an RDU or a PCIe switch on a DataScale system
*/
message SnmlPcieData {
uint32 pcie_id = 1; // ID of this PCIe port relative to its parent component
uint32 bandwidth = 2; // Bandwidth in Gigabytes per second
// Bandwidth capacity for GetStaticResource
// and current bandwidth for GetOnlineResource
uint32 speed = 3; // The speed of the PCIe link in GT/sec
// Capacity speed for GetStaticResource and the current
// speed in GetOnlineResource
uint32 width = 4; // The PCIe link's width (number of lanes) capacity
// for GetStaticResource and current capacity
// for GetOnlineResource
ComponentInformation info = 5; // Generic component info for this PCIe port
}
/**
* This is the parent object that represents the different types of RDU device memory
*/
message SnmlRduMemoryData {
uint64 ddrmem_sz = 1; // The amount of DDR memory in this RDU
<Internal> <i> = 2; // Reserved
uint32 en_ddr_chs = 3; // A bitmap of enabled
<Internal> <i> = 4; // Reserved
repeated <Internal> <i> = 5; // Reserved
repeated SnmlRduDdrData ddr_memory = 6; // A list of DDR objects for each DDR channel
}
/**
* This object represents an RDU DDR channel
*/
message SnmlRduDdrData {
uint32 ddrch_num = 1; // The RDU's DDR channel ID, within the RDU
repeated SnmlDimmData dimm = 2; // A list of DIMMs that belong to that DDR channel
}
/**
* This object represents a DIMM that is inside a DDR channel
*/
message SnmlDimmData {
string dimm_name = 1; // The DIMM's name, like DIMM_M0
uint64 size = 2; // The size in bytes of the DIMM
ComponentInformation info = 3; // Generic ComponentInfo for the DIMM
bytes part_num = 4; // The DIMM's part number.
}
/**
* A vRDU is the name for a SR-IOV VF of an RDU. A vRDU can be provisioned on an RDU PF
* (physical function). Any VF RDUs that are currently provisioned are addressable
* through SnmlRduData.
*
* When VF RDUs are provisioned, the corresponding PFs are still present, but the
* corresponding physical tiles on the PF they were provisioned from are in the functional
* state "virtualized" and are not usable through the PF. The PF remains present for
* management purposes and for fault and error telemetry reporting on the memory/PCIe
* resources
*
* VF RDUs exist only in the GetOnlineResource context because they are inherently
* transient virtual devices. For GetStaticResource, this list is empty.
*
* Starting in SambaFlow 1.17, VFs are represented by SnmlRduData objects, just as
* PF RDUs are.
* The only difference is that the `topology` field contains a string representing
* the VF's topology, while for PFs, `topology` contains the string "PF".
*/
message SnmlRduData {
uint32 rdu_id = 1; // The RDU's socket ID within the XRDU
uint32 n_avail_tiles = 2; // Number of available tiles on this RDU
uint64 ddrmem_sz = 3; // DDR capacity of this RDU
repeated <Internal> <i> = 4; // Reserved
bool is_perfect_rdu = 5; // True if all tiles inside this RDU are healthy
repeated SnmlTileData tile = 6; // List of SnmlTileData objects;
// one for each tile on the RDU
repeated SnmlPcieData pcie = 7; // List of SnmlPcieData objects;
// one for each PCIe port on the RDU
SnmlMemoryData memory = 8; // Summary of the RDU's device memory
ComponentInformation info = 10; // Generic component information about the RDU
repeated SnmlBDF bdf = 11; // Multiple BDFs per RDU is possible
uint64 <Internal> <i> = 13; // Reserved
repeated uint32 hep_id = 14; // PCIe port ID of the host-facing PCIe endpoint
// on this RDU
repeated <Internal> <i> = 15; // Reserved
string topology = 16; // "PF" for PFs, or the VF's topology for VFs
}
/**
* This data structure represents a PCIe switch inside an XRDU
*/
message SnmlSwitchData {
uint32 switch_id = 1; // The switch's ID, taken from the component name
// /NODE/XRDU_0/SW_1 or SW_0
uint32 num_ports = 2; // Number of ports on the switch
ComponentInformation info = 3; // Generic component information for this switch
repeated SnmlPcieData pcie = 4; // List of SnmlPcieData objects;
// one for each port on the switch
}
GetStaticResource
Retrieve information about RDU resources that are physically configured on this DataScale system.
Summary
The GetStaticResource
API is used for retrieving information about RDU resources that are physically configured on this DataScale system. This API returns a hierarchical data structure that represents the DataScale node it is describing.
-
The
GetStaticResource
API provides corresponding static information (for example, how many RDUs are physically installed in this system and what are their serial numbers). -
Its sibling API
GetOnlineResource
provides dynamic information about the node (for example, how much RDU device memory is currently free).
The GetOnlineResource
and GetStaticResource
APIs share output data types, but many fields are populated by only one of the APIs.
The handling of RDU device memory has changed in SambaFlow 1.17. The data structures have a new hierarchical format, and some names have changed. |
The SnmlVrduData type has been deprecated and the SnmlNodeData.vrdu field now consists of SnmlRduData objects.
|
Returns
This function returns the following information. For details about the component hierarchy, see Naming of DataScale components in SNFM output.
/**
* SnmlNodeData is the top-level return type of the data structure for
* GetOnlineResource and GetStaticResource
* It represents a whole node
*/
message SnmlNodeData {
repeated string error_details = 1; // Any error details (if the query failed)
uint32 request_id = 2; // The request ID for the API call
repeated SnmlXrduData xrdu = 3; // List of SnmlXrduData objects; one for each RDU
repeated SnmlRduData vrdu = 4; // List of SnmlRduData objects;
// one for each present vRDU
}
/**
* This enumerator represents a component's Inventory state.
* A component's inventory state is:
* - ABSENT if it is physically absent or unenumerated by the DataScale node
* - PRESENT if it is physically enumerated by the DataScale node
* - VM_PRESENT if it is enumerated, but its RDU resources are provisioned to a VF,
* or if SambaNova Runtime has detected that it is assigned to a virtual machine
* - UNKNOWN if the state cannot be checked. This is uncommon.
*/
enum InventoryState {
ABSENT = 0;
PRESENT = 1;
VM_PRESENT = 2;
UNKNOWN = 3;
}
/**
* This data structure stores information that is applicable to any
* type of component, and is a separate data structure to reduce duplication of fields.
*/
message ComponentInformation {
InventoryState inv_state = 1;
// Every component has an inventory state, as described in the
// InventoryState enumerator.
string ser_num = 2; // Many components have a serial number, which is returned as
// a string
// For components that do not have serial numbers (like TILEs),
// the serial number is N/A
// This field is populated only by the GetStaticInfo API
string name = 3; // Each component has a component name, which specifies where
// in the component hierarchy it is
// (e.g. /NODE/XRDU_0/RDU_0/PCIE_1.)
// This field is populated only by the GetStaticInfo API
}
/**
* This data structure represents a single XRDU on a DataScale system
*/
message SnmlXrduData {
uint32 xrdu_id = 1; // The XRDU's ID - from the component name
repeated SnmlRduData rdu = 2; // List of RDUs that are physically present inside this XRDU
repeated SnmlSwitchData switch = 3;
// List of PCIe switches that are physically
// present inside this XRDU
}
/**
* This data structure represents an RDU within an RDU socket on an XRDU
*/
message SnmlRduData {
uint32 rdu_id = 1; // The RDU's socket ID within the XRDU
uint32 n_avail_tiles = 2; // Number of available tiles on this RDU
uint64 ddrmem_sz = 3; // DDR capacity of this RDU
repeated <Internal> <i> = 4; // Reserved
bool is_perfect_rdu = 5; // True if all tiles inside this RDU are healthy
repeated SnmlTileData tile = 6; // List of SnmlTileData objects;
// one for each tile on the RDU
repeated SnmlPcieData pcie = 7; // List of SnmlPcieData objects;
// one for each PCIe port on the RDU
SnmlMemoryData memory = 8; // Summary of the RDU's device memory
ComponentInformation info = 10; // Generic component information about the RDU
repeated SnmlBDF bdf = 11; // Multiple BDFs per RDU is possible
uint64 <Internal> <i> = 13; // Reserved
repeated uint32 hep_id = 14; // PCIe port ID of the host-facing PCIe endpoint
// on this RDU
repeated <Internal> <i> = 15; // Reserved
string topology = 16; // "PF" for PFs, or the VF's topology for VFs
}
/**
* This data structure represents a PCIe BDF identifier
*/
message SnmlBDF {
uint32 pci_bus = 1; // The PCIe bus number
uint32 pci_device = 2; // The PCIe device identifier
uint32 pci_function = 3; // The PCIe function identifier
}
/**
* This data structure represents an RDU tile inside an RDU
*/
message SnmlTileData {
uint32 tile_id = 1; // The tile's ID within the RDU it belongs to
SnmlGraphData graph_info = 2; // An optionally populated object representing
// any applications that are using this tile
ComponentInformation info = 3; // The generic information relating to this tile
}
/**
* This data structure represents a process that is
* using an RDU component
*/
message SnmlGraphData {
int32 graph_pid = 1; // The Linux PID of the application
}
/**
* This data structure represents a PCIe port or link
* either inside an RDU or a PCIe switch on a DataScale system
*/
message SnmlPcieData {
uint32 pcie_id = 1; // ID of this PCIe port relative to its parent component
uint32 bandwidth = 2; // Bandwidth in Gigabytes per second
// Bandwidth capacity for GetStaticResource and current
// bandwidth for GetOnlineResource
uint32 speed = 3; // Speed of the PCIe link in GT/sec
// Capacity speed for GetStaticResource and the current
// speed in GetOnlineResource
uint32 width = 4; // PCIe link's width (number of lanes) capacity for
// GetStaticResource and current for GetOnlineResource
ComponentInformation info = 5; // Generic component info for this PCIe port
}
/**
* This is the parent object that represents the different types of RDU device memory
*/
message SnmlRduMemoryData {
uint64 ddrmem_sz = 1; // The amount of DDR memory in this RDU
<Internal> <i> = 2; // Reserved
uint32 en_ddr_chs = 3; // A bitmap of enabled
<Internal> <i> = 4; // Reserved
repeated <Internal> <i> = 5; // Reserved
repeated SnmlRduDdrData ddr_memory = 6; // A list of DDR objects for each DDR channel
}
/**
* This object represents an RDU DDR channel
*/
message SnmlRduDdrData {
uint32 ddrch_num = 1; // The RDU's DDR channel ID within the RDU
repeated SnmlDimmData dimm = 2; // A list of DIMMs that belong to that DDR channel
}
/**
* This object represents a DIMM that is inside a DDR channel
*/
message SnmlDimmData {
string dimm_name = 1; // The DIMM's name, like DIMM_M0
uint64 size = 2; // The size in bytes of the DIMM
ComponentInformation info = 3; // Generic ComponentInfo pertaining to the DIMM
bytes part_num = 4; // The DIMM's part number.
}
/**
* A vRDU is the name for a SR-IOV VF of an RDU. A vRDU can be provisioned on an RDU PF
* (physical function). Any VF RDUs that are currently provisioned are addressable
* through SnmlVirtualRduData.
*
* When VF RDUs are provisioned, the corresponding PFs are still present, but the
* corresponding physical tiles on the PF they were provisioned from are in the functional
* state "virtualized" and are not usable through the PF. The PF remains present for
* management purposes and for fault and error telemetry reporting on the memory/PCIe
* resources
*
* VF RDUs exist only in the GetOnlineResource context because they are inherently
* transient virtual devices. For GetStaticResource, this list is empty.
*
* Starting in SambaFlow 1.17, VFs are represented by SnmlRduData objects, just as
* PF RDUs are.
* The only difference is that the `topology` field contains a string representing
* the VF's topology, while for PFs, it contains the string "PF".
*/
message SnmlRduData {
uint32 rdu_id = 1; // The RDU's socket ID within the XRDU
uint32 n_avail_tiles = 2; // Number of available tiles on this RDU
uint64 ddrmem_sz = 3; // DDR capacity of this RDU
repeated <Internal> <i> = 4; // Reserved
bool is_perfect_rdu = 5; // True if all tiles inside this RDU are healthy
repeated SnmlTileData tile = 6; // List of SnmlTileData objects;
// one for each tile on the RDU
repeated SnmlPcieData pcie = 7; // List of SnmlPcieData objects;
// one for each PCIe port on the RDU
SnmlMemoryData memory = 8; // Summary of the RDU's device memory
ComponentInformation info = 10; // Generic component information about the RDU
repeated SnmlBDF bdf = 11; // Multiple BDFs per RDU is possible
uint64 <Internal> <i> = 13; // Reserved
repeated uint32 hep_id = 14; // PCIe port ID of the host-facing PCIe endpoint
// on this RDU
repeated <Internal> <i> = 15; // Reserved
string topology = 16; // "PF" for PFs, or the VF's topology for VFs
}
/**
* This data structure represents a PCIe switch inside an XRDU
*/
message SnmlSwitchData {
uint32 switch_id = 1; // The switch's ID, taken from the component name
// e.g. /NODE/XRDU_0/SW_1 or SW_0
uint32 num_ports = 2; // The number of ports on the switch
ComponentInformation info = 3; // Generic component information pertaining to this switch
repeated SnmlPcieData pcie = 4; // A list of SnmlPcieData objects;
// one for each port on the switch
}
GetSystemFaultState
Summary
The GetSystemFaultState
API retrieves information about inventory components on the system (physical components of the DataScale node) and what functional state they are in. In contrast with the GetSystemFaultLog
API, which returns information about every fault that is diagnosed, this API returns information about every component on the system and what functional state it is in. The GetSystemFaultState
API only returns an entry for a component if its state is not ONLINE
. If the GetSystemFaultState
API returns an empty list, all components on the system are in a healthy state.
Returns
This API returns a stream of objects of type SystemFaultState. Generally, users can treat a client-side-stream based API like an API that promises to return a list of indeterminate length. See the gRPC documentation on client-side stream handling for details.
/**
* This data structure represents a single copmponent
* The SystemFaultState returns a list of these
*/
message SystemFaultState {
repeated string error_details = 1; // Error details for errors that occurrred
// during this query
uint32 request_id = 2; // The request ID provided by the caller
string comp_name = 3; // Name of the component this entry describes
string serial_num = 4; // Serial number of the component, if applicable
string fault_state = 6; // Functional state that this component is in
}
GetSystemFaults
Retrieve the system fault information from the SambaNova Fault Management (SNFM) framework.
In 1.17 this API was renamed from GetSystemFaultLog to GetSystemFaults |
Summary
The GetSystemFaults
API retrieves the system fault information from the SambaNova Fault Management (SNFM) framework. The SNFM framework captures telemetry about faults, errors, and the system’s physical and virtual inventory, and caches historical and current information for user consumption.
The API returns all diagnosed faults that currently exist on this system, including active faults and faults that have been cleared by automatically by SambaNova Runtime or manually by a system administrator.
An hardware fault is different from a hardware event:
-
A fault means that the SNFM framework decided to diagnose a certain hardware component as faulty because some number of events occurred to it that made SNFM suspicious that something might be wrong with that hardware.
-
An event is merely an event that occurred in hardware.
Input
An object of type SystemFaultQuery
. This object is used for selecting which entries in the database are returned to the caller.
If you provide a default object (for example, if you create a Python SystemFaultLogQuery
with the code SystemFaultLogQuery()
) all entries are returned.
/**
* A query object for the System Fault Log
* Used for the caller to specify what kinds of entries should be returned
*/
message SystemFaultQuery {
uint32 request_id = 1; // The request ID, for caller's consumption
uint64 start_timestamp = 2; // The starting timestamp of events to return.
// This is a UNIX timestamp, and all events from after this
// timestamp will be returned.
uint64 end_timestamp = 3; // The ending timestamp of events to return.
// This is a UNIX timestamp, and all events from before this
// timestamp will be returned.
string fault_uuid = 4; // A UUID identifying a specific error event. If this field
// is set, only that error is returned.
string comp_name = 5; // Component name for requesting only faults on
// a single component
string fault_type = 6; // Fault type if only a single fault type is of interest
string fault_state = 7; // Fault state of the component affected by this fault
string err_uuid = 8; // UUID of the error that caused this fault to be diagnosed
string severity = 9; // Severity of this fault CRITICAL, FATAL, etc.
bool vrdu_database = 10; // Set this flag to True if you want information about
// faults that occurred on vRDUs instead of physical RDUs
}
Returns
This API returns a stream of objects of type SystemFault
. Generally, users can treat a client-side-stream based API like an API that promises to return a list of indeterminate length. See the gRPC documentation on client-side stream handling for details.
For more information about the data returned by this API, see the documentation for SNFM and the System Policy Data information retrievable through SNFADM or the SNML GetSystemPolicyInfo API.
/**
* This data structure represents a single SNFM Fault Log entry
* The GetSystemFaultLog API returns a stream of events of this type
*/
message SystemFault {
uint32 request_id = 1; // The request ID provided by the caller
fixed64 timestamp = 2; // The timestamp when this fault was diagnosed
// This is a UNIX timestamp
string fault_uuid = 3; // The UUID identifying this fault
string severity = 4; // Severity of this fault CRITICAL, FATAL, etc.
string comp_name = 5; // Component name affected by this fault
bytes comp_ser_num = 6; // Serial number of the compoonent affected by
// this fault
string fault_desc = 7; // Human-readable fault description with more
// details on what this fault means
string recovery_act = 8; // A human-readable recovery action that administrators
// can take to recover from this fault
string fault_type = 9; // Name of this fault type
string err_uuid = 5; // UUID identifying the error event that caused
// this fault to be diagnosed
fixed64 cleared_timestamp = 12; // The timestamp that this fault was marked cleared,
// if it is cleared. Zero means the fault is currently
// active. This is a Unix timestamp.
repeated string error_details = 13; // Error details, if any error occurred
// while constructing this response
}
GetSystemErrors
Retrieve system events from the SambaNova Fault Management (SNFM) framework.
This API has been renamed in 1.17, from GetSystemErrorLog to GetSystemErrors |
Summary
The GetSystemErrors
API returns the hardware events (mostly error events) that have occurred on the system since it was configured or since the log was manually refreshed. An hardware event is different from a hardware fault:
-
An event is merely an event that occurred in hardware.
-
A fault, on the other hand, is a decision made by the SNFM framework to diagnose a certain hardware component as faulty, because some number of events occurred to it that made SNFM suspicious that something might be wrong with that hardware.
Input
An object of type SystemErrorQuery
. This object is used for selecting what entries in the database should be returned to the caller.
If you provide a default object (for example, if you create a Python SystemErrorLogQuery
with the code SystemErrorLogQuery()
) all entries are returned.
/**
* A query object for the System Error Log
* Used for the caller to specify what kinds of entries should be returned
*/
message SystemErrorQuery {
uint32 request_id = 1; // The request ID, for caller's consumption
uint64 start_timestamp = 2; // The starting timestamp of events to return.
// This is a UNIX timestamp, and all events from after this
// timestamp will be returned.
uint64 end_timestamp = 3; // The ending timestamp of events to return.
// This is a UNIX timestamp, and all events from before this
// timestamp will be returned.
string err_uuid = 4; // A UUID identifying a specific error event. If this field
// is set, only that error will be returned.
uint32 err_count_min = 5; // The bottom of a range of error counts to consider
uint32 err_count_max = 6; // The top of a range of error counts to consider
string err_type = 7; // The string of the name of a specific type of error
// for example, ETYPE_TILE_HANG.
// If passed, only those events will be returned
string comp_name = 8; // Component name for specifying only errors on
// a single component
string fault_uuid = 9; // A UUID identifying a specific diagnosed fault.
// If this field is set, only errors that led to that fault
// being diagnosed will be returned.
bool vrdu_database = 10; // Set this flag to True if you want information about
// errors that occurred on vRDUs instead of physical RDUs
}
Returns
This API returns a stream of objects of type SystemError
. Generally, users can treat a client-side-stream based API like an API that promises to return a list of indeterminate length. See the gRPC documentation on client-side stream handling for details.
For more information about the data returned by this API, see the documentation for SNFM and the System Policy Data information retrievable through SNFADM or the SNML GetSystemPolicyInfo API.
/**
* This data structure represents a single SNFM Error Log entry
* The GetSystemErrorLog API returns a stream of events of this type
*/
message SystemErrorLog {
repeated string error_details = 1; // Error details, if any error occurred
// while constructing this response
uint32 request_id = 2; // Request ID provided by the caller
fixed64 orig_timestamp = 3; // First timestamp this error occurred
// This is a UNIX timestamp
fixed64 last_timestamp = 4; // Most recent timestamp this error occurred
// This is a UNIX timestamp
string err_uuid = 5; // UUID identifying this error event
uint32 err_count = 6; // Number of errors of this type that were
// recorded between the times specified
// in the two timestamps above
string err_type = 7; // Error type that this entry reflects
string err_data = 8; // Architecture-specific error data
string err_type_desc = 9; // Human-readable error type description with
// details on what this error means
string comp_name = 10; // Component name this error occurred on
string fault_uuid = 11; // Fault UUID if this error led to a fault
// being diagnosed. For errors that did not
// lead to a fault, this will be all zeros
}
GetHangRecoveryHistory
Summary
The GetHangRecoveryHistory
API retrieves all recorded hang-recovery events that have happened on the RDU or RDUs specified in the query.
The SNFM framework tracks hangs that are successfully and unsuccessfully recovered from and can return information about such events in this API.
Input
/**
* This is the request type for the GetHangRecoveryHistory API
* The rdu_ids field specifies which RDUs to return information about
*/
message HangRecoveryHistoryRequest{
uint32 request_id = 1; // Request ID
repeated uint32 rdu_ids = 2; // List of RDU IDs to query about
}
Returns
This API returns a stream of HangRecoveryHistory objects; one for each RDU in the request list.
/**
* This data structure represents the entire hang recovery history for one RDU
*/
message HangRecoveryHistory {
repeated string error_details = 1; // Error details, if applicable
uint32 request_id = 2; // Request ID provided by caller
uint32 rdu_id = 3; // The RDU ID that this object pertains to
repeated HangRecoveryEvent hang_recovery_log = 4;
// A list of HangRecoveryEvents
// that occurred on this RDU
}
/**
* This data structure represents a single hang-recovery event
*/
message HangRecoveryEvent{
uint32 rdu_id = 3; // The RDU that was hang-recovered
fixed64 timestamp = 4; // UNIX timestamp when the event occurred
HangRecoveryEventType event_type = 6; // Kind of hang recovery event
HangRecoveryOutcome outcome = 7; // Outcome of the hang recovery event
string comp_name = 8; // Component name of the affected RDU or tile
}
/**
* This enumerator specifies the different kinds of hang recovery events
*/
enum HangRecoveryEventType {
TILE_RESET = 0; // A tile-level reset, on a single RDU tile
CHIP_RESET = 1; // A chip-level reset, applying to a group of 4 tiles
OTHER = 2; // Something else, like a DC power cycle, for example
HANG_RECOVERY_ENUM_MAX = 3; // Sentinel value
}
/**
* This enumerator represents the outcome of a
* hang-recovery event
*/
enum HangRecoveryOutcome {
IN_PROGRESS = 0; // The event is currently in progress
SUCCEEDED = 1; // The hang recovery was successful
FAILED = 2; // The hang recovery failed and the tile was diagnosed as faulty
PENDING = 3;
}
GetSystemVersionInfo
Retrieve information about the system and the currently installed SambaNova Runtime software on the system.
This API has changed in SambaFlow 1.17. It has more fields, and the name has changed from GetRuntimeVersion to GetSystemVersionInfo |
Summary
The GetSystemVersionInfo
API can be used to retrieve information about the system and the currently installed SambaNova Runtime software on the system. It takes an object of type ResourceQuery
with no required fields as input, and returns an object of type SystemVersionInfo
. The SystemVersionInfo
object contains the versions of several components on the system.
-
The SambaFlow version field represents the version of
sambaflow
,sambanova-runtime
, and any other SambaFlow related packages that are installed on the system. -
The other version numbers reflect semantic versions of interfaces that are supported by the currently installed SambaNova Runtime stack.
-
The PEF version reflects the version of PEF that the SambaNova Runtime package was built against. SambaNova Runtime support PEFs generated against any PEF version compatible with that version(following semantic versioning rules). Depending on your situation, you either upgrade Runtime or recompile your PEF.
-
The SNML and SNML Admin versions show semantic versions for the two SNML services.
Returns
/**
* The SemanticVersion data structure represents a 3-part semantic version
* where the effects of version changes meet the SemVer standards
*/
message SemanticVersion {
uint32 major = 1;
uint32 minor = 2;
uint32 patch = 3;
}
message SnfmVersionInfo {
SemanticVersion error_log_version = 1;
SemanticVersion fault_log_version = 2;
SemanticVersion inv_log_version = 3;
SemanticVersion policy_log_version = 4;
}
/**
* This is the output type for the GetSystemVersionInfo
*/
message SystemVersionInfo{
repeated string error_details = 1;
uint32 request_id = 2;
string sambaflow_version = 3;
uint64 runtime_if_version = 4; // User/kernel version
SemanticVersion samba_runtime_version = 5; // samba/runtime version
SemanticVersion pef_version = 6; // PEF version supported by runtime
SemanticVersion snml_version = 7;
SemanticVersion snml_priv_version = 8;
SemanticVersion snml_admin_version = 9;
SemanticVersion snml_virt_version = 10;
SemanticVersion rduc_version = 11;
SemanticVersion bmc_version = 12;
SnfmVersionInfo snfm_version = 13;
SemanticVersion runtime_version = 14; // Overall runtime version
}
GetSystemPolicyInfo
Retrieve the system’s fault-diagnosis policy from the SambaNova Fault Management (SNFM) framework.
Summary
The GetSystemPolicyInfo
API retrieves the system’s fault-diagnosis policy from the SambaNova Fault Management (SNFM) framework. The SNFM framework captures telemetry about faults, errors, and the system’s physical and virtual inventory, and caches historical and current information for users' consumption. See SambaNova Fault Management (SNFM) for background information.
The system fault policy is used to decide how errors and events that occur on the system should lead to SNFM faults being diagnosed. This API allows users to inspect the rules that SNFM uses to diagnose faults.
Returns
This API returns a stream of objects of type SystemPolicyInfo
. Generally, users can treat a client-side-stream based API like an API that promises to return a list of indeterminate length. See the gRPC documentation on client-side stream handling for details.
For more information about the data returned by this API, see SambaNova Fault Management (SNFM).
/**
* This data structure represents a single system fault policy entry
* The GetSystemPolicyInfo API returns a stream of events of this type
*/
message SystemPolicyInfo {
uint32 request_id = 1; // Request ID provided by the caller
string fault_type = 2; // Type of fault that this policy relates to
string error_type = 3; // Type of error that this policy will monitor
// to diagnose faults of the type specified above
string action = 4; // Action that SNFM takes on the faulted component
// when this fault is diagnosed
string severity = 5; // The severity of this fault
string fault_desc = 6; // Human readable brief description of this fault
string fault_detail_desc = 7; // Human readable detailed description of this fault
string recovery_act = 8; // Recovery action for this kind of fault
repeated string error_details = 9; // Architecture-specific error details
}
GetSystemVirtualInventory
Retrieve information about virtual RDUs (vRDUs). A vRDU is the SR-IOV virtual function (VF) of an RDU.
Summary
The GetSystemVirtualInventory
API retrieves information about vRDUs that have been provisioned against physical RDUs on the system.
A vRDU is the SR-IOV VF of an RDU. vRDUs can be provisioned on RDU PFs. Any VF RDUs that are currently provisioned will be addressable through this list.
When VF RDUs are provisioned, the corresponding PFs will still be present, but the corresponding physical tiles on the PF they were provisioned from are in the functional state "virtualized" and they are not usable through the PF. The PF remains present for management purposes and for fault and error telemetry reporting on the memory and PCIe resources.
Input
An object of type SystemVirtualInventoryQuery
. This object has only one field - the request ID.
message SystemVirtualInventoryQuery{
uint32 request_id = 1;
}
Returns
This API returns a stream of objects of type SystemVirtualInventoryData
. Generally, users can treat a client-side-stream based API like an API that promises to return a list of indeterminate length. See the gRPC documentation on client-side stream handling for details.
/**
* This data structure represents a single vRDU
* The GetSystemVirtualInventory returns a list of these
*/
message SystemVirtualInventoryData {
repeated string error_details = 1; // Error details for errors that occurrred
// during this query
uint32 request_id = 2; // The request ID provided by the caller
string comp_name = 3; // Cmponent name of the vRDU (e.g. /NODE/VRDU_8)
string serial_num = 4; // Serial number of the PF that the VF belongs to
string part_number = 5; // Part number. This field is unused
string fault_state = 6; // Functional state that this vRDU is in
string pf_name = 7; // Component name of the vRDU's parent
// physical RDU - the RDU that this VF
// was provisioned from
}
GetSystemLocalPcieRoute
Returns a pairwise list of all the RDUs on this node and the number of routes between them.
Summary
The GetSystemLocalPcieRoute
API returns a pairwise list of all the RDUs on this node and the number of routes between them via the PCIe local fabric. It is useful for checking communication bandwidth between any two RDUs.
This API always returns information about all RDUs.
Returns
/**
* The return type for SystemLocalPcieRoute
*/
message SystemLocalPcieRouteData {
repeated string error_details = 1; // Error details, if applicable
uint32 request_id = 2; // Request ID, provided by caller
repeated RduPairData rdu_pairs = 3; // List of directional pairs, described below
SnmlNodeData port_status = 4; // Contains the port statuses for all ports
// on the system. For details on this data
// structure, see the documentation for
// GetOnlineResource and GetStaticResource
}
/**
* This data structure represents an ordered pair of (src, dest) RDUs
* that are connected by local PCIe within one system.
*/
message RduPairData {
uint32 xrdu_src_id = 1; // The source XRDU ID
uint32 rdu_src_id = 2; // The source RDU ID within the XRDU
uint32 xrdu_dst_id = 3; // The dest XRDU ID
uint32 rdu_dst_id = 4; // The dest RDU ID within the XRDU
uint32 n_expected_routes = 5; // Number of routes we expect on a fully healthy system
uint32 n_actual_routes = 6; // Number of actual routes available right now
repeated LocalPcieRouteData routes = 7; // Details about each route
}
/**
* This data structure provides detailed information
* about one PCIe route in the RduPair
*/
message LocalPcieRouteData {
uint32 src_pcie_id = 1; // The PCIe port ID on the source side
uint32 dst_pcie_id = 2; // The PCIe port ID on the dest side
bool route_up = 3; // True if the route is available
}