SNML API reference (admin users)
The SNML admin APIs are of interest to system administrators and support these tasks:
-
Mark system faults as clear
-
Examine RDU memory
-
Reset RDUs
This reference gives details. See SambaNova Management Layer (SNML) for examples, discussion of the SNML gRPC server, and troubleshooting info.
All SNML admin APIs require root privileges. |
ClearSystemFault
Mark a system fault as cleared.
Summary
The ClearSystemFault
API can be used to programmatically mark faults that have been diagnosed by the SNFM framework as cleared. See SambaNova Fault Management (SNFM) for background information.
Do not mark faults as cleared unless you have taken some action to resolve the fault. If faults are cleared but the root cause still exists, SNFM will likely re-diagnose the same fault later. |
Input
/**
* The request type for clearing a system fault
*/
message ClearSystemFaultRequest {
uint32 request_id = 1; // Request ID, for caller's uses
string fault_uuid = 2; // SNFM fault UUID of the fault you want to clear
}
Returns
This API returns a stream of ClearSystemFaultResponse
objects, one for each RDU in the request list.
/**
* The return type from clearing a system fault
*/
message ClearSystemFaultResponse {
repeated string error_details = 1; // Error details, if applicable
uint32 request_id = 2; // The request ID, provided by the caller
string msg_result = 3; // Message with the result of clearing the fault
// Should be human readable
}
MemReconfigRDU
Examine memory on an RDU.
Summary
The MemReconfigRDU
API is a way to tell SNML to examine and optionally adjust an RDU.
-
Check all RDU memory attached to the RDU.
-
For any physical memory sectors that are unhealthy, re-interleave them out of the memory pool and adjust the resulting memory capacity.
Input
/**
* The top level request type for memory reconfiguration
*/
message MemReconfigRequest {
uint32 request_id = 1; // Caller's request ID
repeated MemReconfigQuery queries = 2; // A list of queries, each specifying
// one RDU to reconfigure
}
/**
* A query type that specifies an RDU to memory-reconfigure
*/
message MemReconfigQuery {
uint32 xrdu_id = 1; // The XRDU the RDU is on
uint32 rdu_id = 2; // The RDU ID within the XRDU
uint32 hep_id = 3; // The HEP of the RDU to use to make this API call
}
ChangeAutoRecoverSettings
Change how and when SambaNova Runtime performs automatic recovery.
Summary
Use the ChangeAutoRecoverSettings
API to change the system-wide settings that determine how SambaNova Runtime performs automatic recovery from hardware errors or faults.
Input
/**
* The top-level request type for changing auto recovery settings
* on the DataScale system
* These settings apply system-wide
*/
message ChangeAutoRecoverSettingsRequest {
uint32 request_id = 1; // Request ID for the caller's consumption
AutoRecover auto_recover = 2; // The new settings to apply
}
/**
* This structure represents each subset of auto-recovery settings
*/
message AutoRecover {
uint32 auto_reset = 1; // 1 to enable automatically resetting RDUs, 0 to disable
uint32 fq_ccr = 2; // 1 to enable automatic tile-level resets, 0 to disable
uint32 health_monitor = 3; // 1 to enable the internal tile health monitor, 0 to disable
}
Returns
/**
* Response type from changing autorecovery settings
*/
message ChangeAutoRecoverSettingsResponse {
uint32 request_id = 1; // Request ID
repeated string error_details = 2; // Error details, if applicable
AutoRecover auto_recover = 3; // The new auto-recovery settings
// after this API is complete
}
/**
* This structure represents each subset of auto-recovery settings
*/
message AutoRecover {
uint32 auto_reset = 1; // 1 to enable automatic RDU resets, 0 to disable
uint32 fq_ccr = 2; // 1 to enable automatic tile-level resets, 0 to disable
uint32 health_monitor = 3; // 1 to enable the internal tile health monitor, 0 to disable
}
ManualRDUReset
Reset an RDU or group of RDUs.
Summary
The ManualRDUReset
API can be used to tell SambaNova daemon (SND) to reset an RDU or group of RDUs.
Input
/**
* The request type for requesting that SND reset an RDU
*/
message ManualRDUResetRequest{
uint32 request_id = 1; // Caller's request ID
repeated uint32 rdus_to_reset = 2; // A list of single-number RDU identifiers to reset
bool wait_for_reinit = 3; // If set to False, SND launches the RDU's
// post-reset intialization in the background.
// If True, this API blocks until re-init is complete
// This can take on the order of a few minutes.
}
Returns
/**
* The response type for requesting RDU resets from SND
*/
message ManualRDUResetResponse{
uint32 request_id = 1; // Caller's request ID
bool wait_for_reinit = 3; // Whether this request was run with wait_for_reinit True
// or False. If True, the success and failure reflects
// both the reset and the re-init. If False, the success
// and failure only reflect the reset itself.
repeated string error_details = 2; // Error details, if applicable
repeated uint32 rdus_reset_successfully = 4; // List of RDUs that were successfully reset
repeated uint32 rdus_failed_to_reset = 5; // List of RDUs that encountered failures
}