SNML API reference (admin users)

The SNML admin APIs are of interest to system administrators and support these tasks:

  • Mark system faults as clear

  • Examine RDU memory

  • Reset RDUs

This reference gives details. See SambaNova Management Layer (SNML) for examples, discussion of the SNML gRPC server, and troubleshooting info.

All SNML admin APIs require root privileges.

ClearSystemFault

Mark a system fault as cleared.

Summary

The ClearSystemFault API can be used to programmatically mark faults that have been diagnosed by the SNFM framework as cleared. See SambaNova Fault Management (SNFM) for background information.

Do not mark faults as cleared unless you have taken some action to resolve the fault. If faults are cleared but the root cause still exists, SNFM will likely re-diagnose the same fault later.

Input

/**
 * The request type for clearing a system fault
 */
message ClearSystemFaultRequest {
    uint32 request_id = 1;      // Request ID, for caller's uses
    string fault_uuid = 2;      // SNFM fault UUID of the fault you want to clear
}

Returns

This API returns a stream of ClearSystemFaultResponse objects, one for each RDU in the request list.

/**
 * The return type from clearing a system fault
 */
message ClearSystemFaultResponse {
    repeated string error_details = 1;  // Error details, if applicable
    uint32 request_id = 2;              // The request ID, provided by the caller
    string msg_result = 3;              // Message with the result of clearing the fault
                                        // Should be human readable
}

MemReconfigRDU

Examine memory on an RDU.

Summary

The MemReconfigRDU API is a way to tell SNML to examine and optionally adjust an RDU.

  • Check all RDU memory attached to the RDU.

  • For any physical memory sectors that are unhealthy, re-interleave them out of the memory pool and adjust the resulting memory capacity.

Input

/**
 * The top level request type for memory reconfiguration
 */
message MemReconfigRequest {
    uint32 request_id = 1;                  // Caller's request ID
    repeated MemReconfigQuery queries = 2;  // A list of queries, each specifying
                                            // one RDU to reconfigure
}

/**
 * A query type that specifies an RDU to memory-reconfigure
 */
message MemReconfigQuery {
    uint32 xrdu_id = 1;     // The XRDU the RDU is on
    uint32 rdu_id = 2;      // The RDU ID within the XRDU
    uint32 hep_id = 3;      // The HEP of the RDU to use to make this API call
}

Returns

/**
 * The return type for memory reconfiguration
 */
message MemReconfigResponse {
    uint32 request_id = 1;              // Caller's request ID
    repeated string error_details = 2;  // Error details, if applicable
}

ChangeAutoRecoverSettings

Change how and when SambaNova Runtime performs automatic recovery.

Summary

Use the ChangeAutoRecoverSettings API to change the system-wide settings that determine how SambaNova Runtime performs automatic recovery from hardware errors or faults.

Input

/**
 * The top-level request type for changing auto recovery settings
 * on the DataScale system
 * These settings apply system-wide
 */
message ChangeAutoRecoverSettingsRequest {
    uint32 request_id = 1;          // Request ID for the caller's consumption
    AutoRecover auto_recover = 2;   // The new settings to apply
}

/**
 * This structure represents each subset of auto-recovery settings
 */
message AutoRecover {
    uint32 auto_reset = 1;      // 1 to enable automatically resetting RDUs, 0 to disable
    uint32 fq_ccr = 2;          // 1 to enable automatic tile-level resets, 0 to disable
    uint32 health_monitor = 3;  // 1 to enable the internal tile health monitor, 0 to disable
}

Returns

/**
 * Response type from changing autorecovery settings
 */
message ChangeAutoRecoverSettingsResponse {
    uint32 request_id = 1;              // Request ID
    repeated string error_details = 2;  // Error details, if applicable
    AutoRecover auto_recover = 3;       // The new auto-recovery settings
                                        // after this API is complete
}

/**
 * This structure represents each subset of auto-recovery settings
 */
message AutoRecover {
    uint32 auto_reset = 1;      // 1 to enable automatic RDU resets, 0 to disable
    uint32 fq_ccr = 2;          // 1 to enable automatic tile-level resets, 0 to disable
    uint32 health_monitor = 3;  // 1 to enable the internal tile health monitor, 0 to disable
}

ManualRDUReset

Reset an RDU or group of RDUs.

Summary

The ManualRDUReset API can be used to tell SambaNova daemon (SND) to reset an RDU or group of RDUs.

Input

/**
 * The request type for requesting that SND reset an RDU
 */
message ManualRDUResetRequest{
    uint32 request_id = 1;              // Caller's request ID
    repeated uint32 rdus_to_reset = 2;  // A list of single-number RDU identifiers to reset
    bool wait_for_reinit = 3;           // If set to False, SND launches the RDU's
                                        // post-reset intialization in the background.
                                        // If True, this API blocks until re-init is complete
                                        // This can take on the order of a few minutes.
}

Returns

/**
 * The response type for requesting RDU resets from SND
 */
message ManualRDUResetResponse{
    uint32 request_id = 1;      // Caller's request ID
    bool wait_for_reinit = 3;   // Whether this request was run with wait_for_reinit True
                                // or False. If True, the success and failure reflects
                                // both the reset and the re-init. If False, the success
                                // and failure only reflect the reset itself.
    repeated string error_details = 2;              // Error details, if applicable
    repeated uint32 rdus_reset_successfully = 4;    // List of RDUs that were successfully reset
    repeated uint32 rdus_failed_to_reset = 5;       // List of RDUs that encountered failures
}