SambaNova Management Layer (SNML)

The SambaNova Management Layer (SNML) contains APIs that you can use for:

  • Requesting information about RDU status.

  • Managing RDUs.

  • Querying information about the host.

  • Performing other DataScale system tasks.

The SNML server is accessible via gRPC on a DataScale system. You can interact with SNML from any gRPC/protobuf client in many languages, including C++, Python, Ruby and many more. For examples of creating a client for gRPC/protobuf see https://grpc.io/.

The example code on this doc page uses Python.

SNML code examples

The following Python code examples of call SNML APIs to perform common management tasks. The examples are written using the Python gRPC client, but any other gRPC client should support equivalent code.

Retrieve the system’s platform information

This code sample shows how to use the SNML GetSystemPlatformInfo API to retrieve a DataScale system’s platform identifier (for example, DataScale SN10-8).

GetSystemPlatformInfo example
# Import the Python gRPC data structures and functions
# These should be available on a DataScale system with the `sambanova-runtime` package installed
from pysnml.snml_rpc_pb2 import *
from pysnml.snml_rpc_pb2_grpc import *

# assume SNML is configured as default
SERVER_HOST = 127.0.0.1
SERVER_PORT = 50053

# Connect to the SNML server on this DataScale system
with grpc.insecure_channel(f'{SERVER_HOST}:{SERVER_PORT}') as channel:
        # A gRPC stub for the SNML RPC service - generated for us by gRPC's Python code
        stub = SNMLRPCStub(channel)

        # Call the API. The `request_id` field is provided for the caller's consumption
        # The `request_id` in the response type will be the same as the one provided
        # This is useful when the caller is sending multiple requests and wants to keep track
        # In this example, we'll just pass 0.
        query_response = stub.GetRuntimeVersion(
            ResourceQuery(request_id=0))

        # This object will have the following fields:
        # - platform_name (str): the platform name "DataScale SN30-8"
        # - request_id (int): the request ID provided by the caller
        # - error_details (list of str): A list of strings specifying any errors that
        #   happened during the calling of this API. If no errors occurred, this list is empty
        # - ver_info (SystemVersionInfo): the version information of the SNFM databases
        #   on this system. Mostly not interesting to consumers.
        print(query_response)

Check an RDU’s operational state

Each RDU’s operational state is tracked internally by SambaNova Runtime. The operational state reflects whether the RDU can be used to run SambaFlow applications on that RDU. An RDU may be in a specific operational state due to physical presence, fault management policy, hardware issues, or other reasons.

  • Fault state is always hardware related.

  • Operational state could be related to hardware or policy. For details about RDU operational states, see Check the status of RDUs.

You can retrieve RDU operational states using the GetRDUOperationalStatus API, as shown in the following sample code:

GetRDUOperationalStatus example
# Import the Python gRPC data structures and functions
# These should be available on DataScale systems with the sambanova-runtime package installed
from pysnml.snml_rpc_pb2 import *
from pysnml.snml_rpc_pb2_grpc import *

# assume SNML is configured as default
SERVER_HOST = 127.0.0.1
SERVER_PORT = 50053

# Connect to the SNML server on this DataScale system
with grpc.insecure_channel(f'{SERVER_HOST}:{SERVER_PORT}') as channel:
    # A gRPC stub for the SNML RPC service
    stub = SNMLRPCStub(channel)

    # Each RDU has a component name like "XRDU_0/RDU_1"
    # It also has a unique 1-number identifier created from its component name.
    # To calculate the 1-number identifier, multiply the (XRDU number) by the
    # (number of RDUs per XRDU) [2]
    # and add the RDU number within the XRDU
    # This list reflects "XRDU_0/RDU_0" and "XRDU_0/RDU_1"
    rdu_ids = [0, 1]

    # Request the operational status of the RDUs we listed above
    query_response = stub.GetRDUOperationalStatus(
        RDUOperationalStatusRequest(request_id=0,
                                    rdu_ids=rdu_ids))

    # This is a generator object. You can iterate through it,
    # but it is not subscriptible
    for rdu in query_response:
        # The RDU ID is as specified above
        # The `comp_name` is something like "/NODE/XRDU_0/RDU_1"
        # The state is an enum with one of the following values:
        #    - RDU_STATE_FUNCTIONAL
        #    - RDU_STATE_PENDING
        #    - RDU_STATE_DEGRADED
        #    - RDU_STATE_UNAVAILABLE
        print(rdu.rdu_id, rdu.comp_name,
              RDUOperationalState.Name(rdu.state))

Request RDU resets on unavailable RDUs

This example uses an SNML Admin API and requires root privileges.

An RDU may be usable or unusable based on its operational state. The following code example shows how to use SNML Admin to:

  • Check if any of the RDUs are not healthy.

  • Reset any RDUs that aren’t healthy.

  • Poll the operational state until the command returns to RDU_STATE_FUNCTIONAL (fully healthy).

SNML admin example for resetting and polling an RDU
# Retrieving RDU operational state is a "plain" SNML API
# No root privileges are required
from pysnml.snml_rpc_pb2 import *
from pysnml.snml_rpc_pb2_grpc import *

# Requesting RDU resets is an SNML Admin API
# Callers require root privileges and will be checked by the server-side
from pysnml.snml_rpc_admin_pb2 import *
from pysnml.snml_rpc_admin_pb2_grpc import *

import time
import sys

# assume SNML is configured as default
SERVER_HOST = "127.0.0.1"
SERVER_PORT = 50053


with grpc.insecure_channel(f'{SERVER_HOST}:{SERVER_PORT}') as channel:
    # Connect to the SNML admin and SNML servers
    # Each service requries a Stub
    stub = SNMLRPCStub(channel)
    admin_stub = SNMLRPCADMINStub(channel)

    # Launch an asynchronous RDU reset
    # The field wait_for_reinit being set to false tells SND to launch the RDU's
    # post-reset intialization in the background
    # We will poll on the RDU's operational state later to
    # make sure that the reset finishes successfully.
    # Alternatively, `wait_for_reinit=True` tells SND to not return from this query
    # until the reset succeeds or fails
    rst_query_response = admin_stub.ManualRDUReset(
            ManualRDUResetRequest(request_id=0,
                                  rdus_to_reset=[1], # Reset RDU 1 (XRDU_0/RDU_1)
                                  wait_for_reinit=False))
    print("Requested SND to reset RDU")


    # Request the RDU's operational status
    query_response = stub.GetRDUOperationalStatus(
        RDUOperationalStatusRequest(request_id=0,
                                    rdu_ids=[1]))
    print("Requested RDU's operational status from SND")


    # The RDU will be in state RDU_STATE_PENDING while the reset is in progress
    # And will return to RDU_STATE_FUNCTIONAL after it's complete
    while True: # you could specify a maximum number of iterations here
        for rdu in query_response:
            print(f"RDU {rdu.rdu_id} in state {RDUOperationalState.Name(rdu.state)}")
            if RDUOperationalState.Name(rdu.state) == "RDU_STATE_FUNCTIONAL":
                print("Successfully reset and polled status of RDU 1")
                sys.exit(0)

        time.sleep(5)

About the SNML APIs

Before you start using SNML APIs, it’s useful to learn about the two services and how they are versioned. It’s also useful to understand the request_id and error_details fields that are used in most APIs.

Supported gRPC services

The SNML server provides two gRPC services with corresponding APIs:

  • SNML APIs are for all DataScale users and do not require elevated privileges. Most of the APIs are read-only and safe for anyone to call without affecting the environment.

  • SNML Admin APIs require root privileges and are for system administrators. These APIs include read-write operations (such as marking a system fault as clear) and read-only operations that return sensitive information.

The two services are separate gRPC services under one TCP port. gRPC uses the concept of a stub to abstract a group of APIs. Each service has its own gRPC stubs that it can be accessed with. Both services are part of the SNML software layer.

gRPC services are accessible only from the same host and not over the network.

Versioning

The SNML services are both semantically versioned. Major version changes are potentially incompatible, minor changes are backward compatible, and patch changes are forward compatible.

Use the GetRuntimeVersion SMNL API to retrieve the current version number of the SNML service that you are currently querying against.

SNML error_details and request_id fields in SNML APIs

Many of the APIs use request_id for input data and might use error_details if there are SNML errors.

request_id

  • The input data types of all SNML APIs have an optional request_id field. This field can be useful if the caller plans to send several API requests of the same type to the server, for example in an asynchronous or threaded environment. Because the request_id returned by SNML in an API’s output data structure always matches the request_id in the input data structure, you can check which request maps to which return.

  • Some API calls don’t require any input data. For these, the message ResourceQuery is provided. It has only one field, which is the request_id described above. The ResourceQuery object is shown below.

message ResourceQuery {
    uint32 request_id = 1;
}

gRPC return codes and error_details

  • If an API request makes it all the way to the SNML server successfully, it returns the grpc::OK status code.

  • If the SNML server was not connected to, crashed, or there were other issues that occurred before the message reached SNML, the request results in a standard gRPC return code. See Troubleshooting SNML for some error codes.

  • If an error occurs inside SNML, the error_details field is populated. The error_details field is a list of strings.

    • If error_details is an empty list, no errors occurred during the operation of the API.

    • If error_details is non-empty, it contains details about the failing parts of the SNML API.

For information about the status of your DataScale system and its components, use the APIs in SNML API reference (all users) and SNML API reference (admin users).

Manage the SNML server

The SNML RPC server is a gRPC server that runs on the DataScale system’s host and is part of the SND service. Because the gRPC server is a Linux systemd service it is managed by standard .service files.

In most cases, you don’t have to explicitly change the SNML server. You need root privileges to make changes to the SNML services.

Change the SNML server port

By default, the server runs on port 50053 while the SND system service is running.

To change the port that SNML listens on:

  1. Run systemctl edit snd.

  2. Set the SND environment variable SNML_SERVER as HOST:PORT. PORT can be any available port on the system.

Start or stop the SNML server

SNML is started as a component of the SND (SambaNova daemon) system service. If the SND service is running, the SNML server should also be running.

  • To check if the SNML server is running, run systemctl status snd and check if you see Active(running). This command does not require root privileges.

  • To start or stop the SNML server explicitly, use systemctl commands like systemctl restart snd or systemctl stop snd. See Manage SND.

Examine log messages

To examine SNML log messages, check /var/log/sambaflow/runtime/snd.log. This log includes SND log messages and SNML log messages.

Troubleshooting SNML

The following errors might occur when you’re using SNML:

SNML RPC call failure UNAVAILABLE

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "failed to connect to all addresses"
    debug_error_string = "{"created":"@1682552398.981568670","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3158,"referenced_errors":[{"created":"@1682552398.981567400","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":147,"grpc_status":14}]}"

RuntimeError: SNML RPC Call Failure UNAVAILABLE : failed to connect to SND

If you see a message like this, the SNML server could not be reached. Either SND is not running, or SambaNova Runtime is not installed, or the SNML server crashed while trying to service your API request.

  • To find out if SND is not running, call systemctl status snd If SND is not running, call systemctl start snd. The command requires root privileges.

  • To find out if runtime is not installed, call apt list --installed (or the yum equivalent). If Runtime is not installed, call apt install sambanova-runtime

  • To find out if the SND server crashed, check the systemctl restart history for SND. If there’s a problem with repeated crashes, contact SambaNova Customer Support.

SNML RPC UNIMPLEMENTED error code

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNIMPLEMENTED
    details = "API not implemented"
    debug_error_string = "{"created":"@1682552398.981568670","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3158,"referenced_errors":[{"created":"@1682552398.981567400","description":"unimplemented","file":"src/core/lib/transport/error_utils.cc","file_line":147,"grpc_status":14}]}"

The UNIMPLEMENTED error code occurs when SNML cannot find the API you have requested. Here are some possible reasons:

  • You are requesting an API that is not part of the service you are using (or there’s a typo).

  • The version number of the SNML server and of your SNML client do not match.

Make sure you build your client against the protobuf file from the correct version of SambaNova Runtime. SNML is backward compatible across minor versions and forward compatible across patch versions, so version incompatibility is only a possible problem if the major version has changed.

  • To see the user-side version, check the protobuf file.

  • To see the client-side version, use the GetRuntimeVersion SNML API.