SambaNova Management Layer (SNML)
The SambaNova Management Layer (SNML) contains APIs that you can use for:
-
Requesting information about RDU status.
-
Managing RDUs.
-
Querying information about the host.
-
Performing other DataScale system tasks.
The SNML server is accessible via gRPC on a DataScale system. You can interact with SNML from any gRPC/protobuf client in many languages, including C++, Python, Ruby and many more. For examples of creating a client for gRPC/protobuf see https://grpc.io/.
The example code on this doc page uses Python. |
Some of the SNML APIs have changed between SambaFlow 1.16 and 1.17. Please review the SNML API reference page and the release notes for details. |
SNML code examples
The following Python code examples of call SNML APIs to perform common management tasks. The examples are written using the Python gRPC client, but any other gRPC client should support equivalent code.
Retrieve the system’s platform information
This code sample shows how to use the SNML GetSystemPlatformInfo
API to retrieve a DataScale system’s platform identifier (for example, DataScale SN10-8
).
GetSystemPlatformInfo example
# Import the Python gRPC data structures and functions
# These should be available on a DataScale system with the `sambanova-runtime` package installed
from pysnml.snml_rpc_pb2 import *
from pysnml.snml_rpc_pb2_grpc import *
# assume SNML is configured as default
SERVER_HOST = 127.0.0.1
SERVER_PORT = 50053
# Connect to the SNML server on this DataScale system
with grpc.insecure_channel(f'{SERVER_HOST}:{SERVER_PORT}') as channel:
# A gRPC stub for the SNML RPC service - generated for us by gRPC's Python code
stub = SNMLRPCStub(channel)
# Call the API. The `request_id` field is provided for the caller's consumption
# The `request_id` in the response type will be the same as the one provided
# This is useful when the caller is sending multiple requests and wants to keep track
# In this example, we'll just pass 0.
query_response = stub.GetRuntimeVersion(
ResourceQuery(request_id=0))
# This object will have the following fields:
# - platform_name (str): the platform name "DataScale SN30-8"
# - request_id (int): the request ID provided by the caller
# - error_details (list of str): A list of strings specifying any errors that
# happened during the calling of this API. If no errors occurred, this list is empty
# - ver_info (SystemVersionInfo): the version information of the SNFM databases
# on this system. Mostly not interesting to consumers.
print(query_response)
Check an RDU’s operational state
Each RDU’s operational state is tracked internally by SambaNova Runtime. The operational state reflects whether the RDU can be used to run SambaFlow applications on that RDU. An RDU may be in a specific operational state due to physical presence, fault management policy, hardware issues, or other reasons.
-
Fault state is always hardware related.
-
Operational state could be related to hardware or policy. For details about RDU operational states, see Check the status of RDUs.
You can retrieve RDU operational states using the GetRDUOperationalStatus
API, as shown in the following sample code:
GetRDUOperationalStatus example
# Import the Python gRPC data structures and functions
# These should be available on DataScale systems with the sambanova-runtime package installed
from pysnml.snml_rpc_pb2 import *
from pysnml.snml_rpc_pb2_grpc import *
# assume SNML is configured as default
SERVER_HOST = 127.0.0.1
SERVER_PORT = 50053
# Connect to the SNML server on this DataScale system
with grpc.insecure_channel(f'{SERVER_HOST}:{SERVER_PORT}') as channel:
# A gRPC stub for the SNML RPC service
stub = SNMLRPCStub(channel)
# Each RDU has a component name like "XRDU_0/RDU_1"
# It also has a unique 1-number identifier created from its component name.
# To calculate the 1-number identifier, multiply the (XRDU number) by the
# (number of RDUs per XRDU) [2]
# and add the RDU number within the XRDU
# This list reflects "XRDU_0/RDU_0" and "XRDU_0/RDU_1"
rdu_ids = [0, 1]
# Request the operational status of the RDUs we listed above
query_response = stub.GetRDUOperationalStatus(
RDUOperationalStatusRequest(request_id=0,
rdu_ids=rdu_ids))
# This is a generator object. You can iterate through it,
# but it is not subscriptible
for rdu in query_response:
# The RDU ID is as specified above
# The `comp_name` is something like "/NODE/XRDU_0/RDU_1"
# The state is an enum with one of the following values:
# - RDU_STATE_FUNCTIONAL
# - RDU_STATE_PENDING
# - RDU_STATE_DEGRADED
# - RDU_STATE_UNAVAILABLE
print(rdu.rdu_id, rdu.comp_name,
RDUOperationalState.Name(rdu.state))
Request RDU resets on unavailable RDUs
This example uses an SNML Admin API and requires root privileges. |
An RDU may be usable or unusable based on its operational state. The following code example shows how to use SNML Admin to:
-
Check if any of the RDUs are not healthy.
-
Reset any RDUs that aren’t healthy.
-
Poll the operational state until the command returns to RDU_STATE_FUNCTIONAL (fully healthy).
SNML admin example for resetting and polling an RDU
# Retrieving RDU operational state is a "plain" SNML API
# No root privileges are required
from pysnml.snml_rpc_pb2 import *
from pysnml.snml_rpc_pb2_grpc import *
# Requesting RDU resets is an SNML Admin API
# Callers require root privileges and will be checked by the server-side
from pysnml.snml_rpc_admin_pb2 import *
from pysnml.snml_rpc_admin_pb2_grpc import *
import time
import sys
# assume SNML is configured as default
SERVER_HOST = "127.0.0.1"
SERVER_PORT = 50053
with grpc.insecure_channel(f'{SERVER_HOST}:{SERVER_PORT}') as channel:
# Connect to the SNML admin and SNML servers
# Each service requries a Stub
stub = SNMLRPCStub(channel)
admin_stub = SNMLRPCADMINStub(channel)
# Launch an asynchronous RDU reset
# The field wait_for_reinit being set to false tells SND to launch the RDU's
# post-reset intialization in the background
# We will poll on the RDU's operational state later to
# make sure that the reset finishes successfully.
# Alternatively, `wait_for_reinit=True` tells SND to not return from this query
# until the reset succeeds or fails
rst_query_response = admin_stub.ManualRDUReset(
ManualRDUResetRequest(request_id=0,
rdus_to_reset=[1], # Reset RDU 1 (XRDU_0/RDU_1)
wait_for_reinit=False))
print("Requested SND to reset RDU")
# Request the RDU's operational status
query_response = stub.GetRDUOperationalStatus(
RDUOperationalStatusRequest(request_id=0,
rdu_ids=[1]))
print("Requested RDU's operational status from SND")
# The RDU will be in state RDU_STATE_PENDING while the reset is in progress
# And will return to RDU_STATE_FUNCTIONAL after it's complete
while True: # you could specify a maximum number of iterations here
for rdu in query_response:
print(f"RDU {rdu.rdu_id} in state {RDUOperationalState.Name(rdu.state)}")
if RDUOperationalState.Name(rdu.state) == "RDU_STATE_FUNCTIONAL":
print("Successfully reset and polled status of RDU 1")
sys.exit(0)
time.sleep(5)
About the SNML APIs
Before you start using SNML APIs, it’s useful to learn about the two services and how they are versioned. It’s also useful to understand the request_id
and error_details
fields that are used in most APIs.
Supported gRPC services
The SNML server provides two gRPC services with corresponding APIs:
-
SNML APIs are for all DataScale users and do not require elevated privileges. Most of the APIs are read-only and safe for anyone to call without affecting the environment.
-
SNML Admin APIs require root privileges and are for system administrators. These APIs include read-write operations (such as marking a system fault as clear) and read-only operations that return sensitive information.
The two services are separate gRPC services under one TCP port. gRPC uses the concept of a stub to abstract a group of APIs. Each service has its own gRPC stubs that it can be accessed with. Both services are part of the SNML software layer.
gRPC services are accessible only from the same host and not over the network. |
Versioning
The SNML services are both semantically versioned. Major version changes are potentially incompatible, minor changes are backward compatible, and patch changes are forward compatible.
Use the GetRuntimeVersion
SMNL API to retrieve the current version number of the SNML service that you are currently querying against.
The SNML server version has changed between SambaFLow 1.16 and 1.17. Please recompile any gRPC and Protobuf files you have created. |
SNML error_details and request_id fields in SNML APIs
Many of the APIs use request_id
for input data and might use error_details
if there are SNML errors.
request_id
-
The input data types of all SNML APIs have an optional
request_id
field. This field can be useful if the caller plans to send several API requests of the same type to the server, for example in an asynchronous or threaded environment. Because therequest_id
returned by SNML in an API’s output data structure always matches therequest_id
in the input data structure, you can check which request maps to which return. -
Some API calls don’t require any input data. For these, the message
ResourceQuery
is provided. It has only one field, which is therequest_id
described above. TheResourceQuery
object is shown below.
message ResourceQuery {
uint32 request_id = 1;
}
gRPC return codes and error_details
-
If an API request makes it all the way to the SNML server successfully, it returns the
grpc::OK
status code. -
If the SNML server was not connected to, crashed, or there were other issues that occurred before the message reached SNML, the request results in a standard gRPC return code. See Troubleshooting SNML for some error codes.
-
If an error occurs inside SNML, the
error_details
field is populated. Theerror_details
field is a list of strings.-
If
error_details
is an empty list, no errors occurred during the operation of the API. -
If
error_details
is non-empty, it contains details about the failing parts of the SNML API.
-
For information about the status of your DataScale system and its components, use the APIs in SNML API reference (all users) and SNML API reference (admin users).
Manage the SNML server
The SNML RPC server is a gRPC server that runs on the DataScale system’s host and is part of the SND service. Because the gRPC server is a Linux systemd
service it is managed by standard .service
files.
In most cases, you don’t have to explicitly change the SNML server. You need root privileges to make changes to the SNML services. |
Change the SNML server port
By default, the server runs on port 50053 while the SND system service is running.
To change the port that SNML listens on:
-
Run
systemctl edit snd
. -
Set the SND environment variable
SNML_SERVER
asHOST:PORT
.PORT
can be any available port on the system.
Start or stop the SNML server
SNML is started as a component of the SND (SambaNova daemon) system service. If the SND service is running, the SNML server should also be running.
-
To check if the SNML server is running, run
systemctl status snd
and check if you seeActive(running)
. This command does not require root privileges. -
To start or stop the SNML server explicitly, use
systemctl
commands likesystemctl restart snd
orsystemctl stop snd
. See Manage SND.
Troubleshooting SNML
The following errors might occur when you’re using SNML:
SNML RPC call failure UNAVAILABLE
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1682552398.981568670","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3158,"referenced_errors":[{"created":"@1682552398.981567400","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":147,"grpc_status":14}]}"
RuntimeError: SNML RPC Call Failure UNAVAILABLE : failed to connect to SND
If you see a message like this, the SNML server could not be reached. Either SND is not running, or SambaNova Runtime is not installed, or the SNML server crashed while trying to service your API request.
-
To find out if SND is not running, call
systemctl status snd
If SND is not running, callsystemctl start snd
. The command requires root privileges. -
To find out if runtime is not installed, call
apt list --installed
(or theyum
equivalent). If Runtime is not installed, callapt install sambanova-runtime
-
To find out if the SND server crashed, check the
systemctl restart
history for SND. If there’s a problem with repeated crashes, contact SambaNova Customer Support.
SNML RPC UNIMPLEMENTED error code
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNIMPLEMENTED
details = "API not implemented"
debug_error_string = "{"created":"@1682552398.981568670","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3158,"referenced_errors":[{"created":"@1682552398.981567400","description":"unimplemented","file":"src/core/lib/transport/error_utils.cc","file_line":147,"grpc_status":14}]}"
The UNIMPLEMENTED
error code occurs when SNML cannot find the API you have requested. Here are some possible reasons:
-
You are requesting an API that is not part of the service you are using (or there’s a typo).
-
The version number of the SNML server and of your SNML client do not match.
Make sure you build your client against the protobuf file from the correct version of SambaNova Runtime. SNML is backward compatible across minor versions and forward compatible across patch versions, so version incompatibility is only a possible problem if the major version has changed.
-
To see the user-side version, check the protobuf file.
-
To see the client-side version, use the GetRuntimeVersion SNML API.