DataScale SN30 Rack System Administration

Copyright © 2020-2023 by SambaNova Systems, Inc. All contents are subject to a licensing agreement with SambaNova Systems, Inc. Any disclosure, reproduction, distribution, reverse engineering, or any other use made without the advance written permission of SambaNova Systems, Inc. is unauthorized and strictly prohibited. All rights of ownership and enforcement are reserved.

Table of Contents

1. Get started with DataScale SN30 rack administration

This SambaNova DataScale® hardware administration document targets the SN30 version of the SambaNova DataScale rack.

This page gets you started:

  • Learn about SambaNova support, SambaNova documentation, and other resources.

  • Get an overview of the DataScale hardware and software stacks.

See the DataScale hardware installation documentation for details on hardware installation requirements and tasks.

1.1. SambaNova support

SambaNova customers that have valid support contracts can contact support and obtain product support documentation through the SambaNova support portal at https://support.sambanova.ai.

1.2. SambaNova documentation

As part of hardware installation, you might need SambaNova documentation, SambaNova KBs, and third-party documentation.

1.3. Third-party documentation

For operational issues with the third-party components in the DataScale SN30 rack, see the following vendor-specific product documentation. If you need additional support or have troubleshooting questions related to troubleshooting, open a support case through SambaNova Support. See KB article #1017, "SambaNova Systems Support Best Practices," at https://support.sambanova.ai.

Do not open a support case with the product vendor.

1.4. Overview of DataScale SN30 rack hardware

The DataScale SN30 is self-contained in a standard 42 rack unit (RU) datacenter rack. Different configurations are available for purchase, depending on customer requirements (including data center requirements). System population begins at the bottom of the rack with node 1 and increments up the rack. Network switches and other equipment are installed at the top of the rack.

A DataScale SN30 rack system consists of:

  • SN30-2 modules. Four DataScale SN30-2 RDU modules. Each DataScale SN30-2 module contains two Reconfigurable Data Units™ (RDUs), for a total of eight RDUs per DataScale SN30 rack system. The RDUs are managed by the SambaFlow software stack running on the host.

  • Host module. An x86-based DataScale SN30-H host module running either Red Hat® Enterprise Linux® or Ubuntu® Linux.

Both the DataScale SN30-2 RDU module and the DataScale SN30-H module are 2RU chassis.

Switch equipment at the top of the rack provides a data network and an access network by default. The following image and table identify the main components in the DataScale SN30 rack.

DataScale SN30 rack Components (front view)
Figure 1. DataScale SN30 rack components (front view)
Table 1. DataScale SN30 rack components
No. Component

1

System 1 SN30-8 (SN30-H)

2

System 1 SN30-8 (four SN30-2)

3

System 2 SN30-8 (SN30-H)

4

System 2 SN30-8 (four SN30-2)

5

Juniper® QFX5130 Ethernet (fan-side)

6

Lantronix® serial console server (Juniper EX series switch behind)

1.5. SambaNova DataScale software stack

The software stack consists of the following components:

  • Host module OS. At the bottom of the stack is the host module OS, either RHEL or Ubuntu.

  • SambaFlow SambaFlow™ is a software stack that is running on SambaNova systems. This stack includes

    • SambaFlow Runtime. Responsible for communication with the DataScale hardware including hardware initialization, error handling, resource management, and interfacing with userspace processes requesting hardware resources.

    • Compilers. Proprietary compilers make your models available to the DataScale hardware.

    • SambaFlow Python SDK which developers use to create and run models.

The SambaFlow software is installed and executed on the SN30-H host modules.

The SambaFlow documentation (SambaFlow SDK and SambaFlow Runtime) describe the software stack, model development, and deployment. See https://docs.sambanova.ai.

1.5.1. DataScale SN30 host module OS

The DataScale SN30 rack includes two preinstalled OS (operating system) flavors that run on the DataScale SN30-H host module on each system:

The SN30-H host module supports the following OS versions:

  • Red Hat Enterprise Linux 8.5

  • Ubuntu Server 20.04.2 Long-Term Support (LTS)

Both images are preinstalled on each SN30-H host module.

SambaNova provides updates for the OS images and updates for the software components through a repository that is described in Connecting to the SambaNova OS repository.

1.5.2. How to identify the SambaFlow software version

The command you run to identify the version of the SambaFlow software packages that are installed on the DataScale SN30-H host modules depends on the OS that is running on the module.

  • RHEL Identify the software version on RHEL:

# dnf list installed | grep samba[nf]

The command results in output that starts like the following (the exact output depends on the SambaFlow version you are using):

sambaflow.x86_64                            1.12.7-15.el8
sambaflow-apps-datascale-image-unet.x86_64  1.12.7-15.el8
sambaflow-apps-starters-logreg.x86_64		1.12.7-15.el8
sambaflow-cpp.x86_64						1.12.7-15.el8
sambaflow-deps-capnproto.x86_64				0.8.0-1.el8
sambaflow-deps-isl.x86_64					0.22-1.el8
sambaflow-deps-pillow-simd.x86_64			7.2.0.post1-1.el8
sambaflow-deps-venv.x86_64					1.12.4-2.el8
sambaflow-exec.x86_64						1.12.7-15.el8
sambaflow-tools-llvm11.x86_64				11.0.0-3.rc1.el8
...

Identify the software version on Ubuntu Linux:

# apt list --installed | grep samba[nf]

The command results in output that starts like the following:

sambaflow-apps-datascale-language-transformers/focal,focal,now 1.13.0-2207251206 amd64
sambaflow-apps-starters-logreg/focal,focal,now 1.13.0-2207251206 amd64
sambaflow-cpp/focal,now 1.12.4-2203291247 amd64
sambaflow-deps-capnproto/focal,focal,focal,focal,focal,focal,focal,focal,focal,focal,focal,focal,focal,now 0.8.0-1 amd64
sambaflow-deps-pillow-simd/focal,focal,focal,focal,focal,focal,focal,focal,focal,focal,focal,focal,now 7.2.0.post1-1 amd64
sambaflow-exec/focal,focal,now 1.13.0-2207251206 amd64
...

1.6. Default username and passwords for components

The following table shows several components in the DataScale SN30 rack that have default passwords for users with administrative/root credentials. See Network administration for information on changing passwords for switches.

SambaNova highly recommends that you change these default passwords as soon as possible.
Do not use a slash character in a password for an XRDU. Both forward slash (/) and backward slash (\) can cause problems.
Table 2. Default usernames and passwords
Component Username Default password

Lantronix serial console server

sysadmin

Changeme

Juniper QFX5130 high-bandwidth Ethernet data switch

root

Changeme

Juniper EX series ccess switch

root

Changeme

DataScale SN30-2/XRDU BMC

root

1Changeme

DataScale SN30-H BMC

admin

Changeme
NOTE: Password must not exceed 14 characters.

DataScale SN30-H OS

root

Changeme

DataScale SN30-H OS

snuser1

Changeme

VertivTM PDU

admin

Changeme

By default, the operating system on SN30-H is configured with a user snuser1 which has superuser privileges (i.e. can run sudo commands). The post-install test of the system uses this user to run example applications. For security reasons SambaNova recommends that you delete this user after the test is completed. You can then create your own users or configure the system to use a company-wide LDAP server.

2. Network administration

This page has information about network administration for the DataScale® SN30 rack.

  • Pointers to third-party documents for the network devices.

  • Instructions for changing passwords for the network devices.

  • Shows examples for the DataScale SN30 rack IP address assignments for the management, access, and data networks, as described in the DataScale hardware installation. The actual IP addresses depend on the subnets and host IP addresses in the Pre-Delivery Site Survey document that your company provided before delivery and installation of the DataScale SN30 rack.

In a single-node DataScale deployment, an amber light appears on port 16 of the QFX5130. This is expected behavior for this switch.

2.1. Network device administration

Most users do not configure the serial console server and the Juniper access switch. This topic discusses only tasks that you’re likely to perform and includes sample IP addresses. For more information:

2.1.1. Change default passwords for switches

SambaNova highly recommends that you change the default passwords at first login.

This section gives instruction for changing passwords on switches. Change the default password for other components as well. See Default Passwords.

Change password for the Juniper EX series access switch and QFX5130 data switch:

  1. Run the following command:

    $ ssh root@<Juniper_switch_IP_address>
    root@:RE:0% cli
    root> configure
    root# set system root-authentication plain-text-password
    root# commit
  2. Log out of the switch by using the exit command 3 times (exit config mode, exit operational mode, exit the Linux CLI)

  3. Log back in with the new password.

Lantronix SLC8000 serial console server:

Run the following command:

$ ssh sysadmin@<Lantronix_switch_IP_address>
> set localusers password sysadmin

2.1.2. Patch releases for network devices

SambaNova provides a periodic patch release for these network devices. You can download these patches from the SambaNova ext-infra-patch repository. See KB article #1062 "Listing and downloading available SN30 rack firmware" for details.

Patch release notes explain any steps that differ from the standard steps described in the specific product administration documentation.

2.2. IP address assignments for the access and management network

The management and access network share the same 1GbE Switch but, depending on the customer requirements, they can be the same network or two separate networks separated by VLAN. In the table below, the example IP addresses assume that the customer chose to merge the access and management networks into the same network.

Table 3 shows examples for the access and management network IP address assignments for components such as the BMC (baseboard management controller), the switch equipment, and the PDUs in the DataScale SN30 rack.

The information the Example IP address (10.0.1.0/24) column assume a customer who provided a 10.0.1.0/24 subnet. The IP address range is starting at .16 in the last octet because some IPs are reserved for SambaNova usage. The addresses include placeholders for customer networking infrastructure like gateway IP.

Table 3. Access network IP address assignments
Example IP address (10.0.1.0/24) Component System #

10.0.1.1-4 Reserved for customer infra

-

-

10.0.1.5-15 Reserved for SambaNova

-

-

10.0.1.16

Serial console server

-

10.0.1.17

Access/Mgmt switch

-

10.0.1.18

Data switch

-

10.0.1.19

PDU 1

-

10.0.1.20

PDU 2

-

10.0.1.21

PDU 3

-

10.0.1.22

PDU 4

-

10.0.1.23

SN30-H-1 OS (eth0)

System 1

10.0.1.24

SN30-H-1 BMC

System 1

10.0.1.25

SN30-H-1-XRDU0 BMC

System 1

10.0.1.26

SN30-H-1-XRDU1 BMC

System 1

10.0.1.27

SN30-H-1-XRDU2 BMC

System 1

10.0.1.28

SN30-H-1-XRDU3 BMC

System 1

10.0.1.29

SN30-H-2 OS (eth0)

System 2

10.0.1.30

SN30-H-2 BMC

System 2

10.0.1.31

SN30-H-2-XRDU0 BMC

System 2

10.0.1.32

SN30-H-2-XRDU1 BMC

System 2

10.0.1.33

SN30-H-2-XRDU2 BMC

System 2

10.0.1.34

SN30-H-2-XRDU3

System 2

x.x.x.255
Broadcast IP address for network

-

-

2.3. IP address assignments for the data network

Table 4 shows examples for the high-bandwidth data network IP address assignments for the compute components in the DataScale SN30 rack.

The example IP addresses shown in the Example IP address (10.0.1.64/27) column assume a customer who provided a 10.0.1.64/27 subnet.

Table 4. Data network IP address assignments
Example IP address (10.0.1.64/27) Component System #

x.x.x.1-4
Reserved for customer infra.

-

-

10.0.2.5

SN30-H-1 snhni0

System 1

10.0.2.6

SN30-H-2 snhni0

System 2

10.0.2.31
Broadcast IP address for network

-

-

3. DataScale SN30 power management

For proper operation of the DataScale® SN30 rack and to prevent issues, be sure you power on and power off the system appropriately and in the correct sequence, as described on this page.

3.1. Warnings and general notes

The following notices apply to the DataScale SN30 rack.

Some components within the rack work at high voltage. To prevent personal injury and voiding of the warranty, do not attempt to service components except where noted.
To protect the DataScale SN30 rack from interference and to prevent damage to its components, keep the front and rear rack doors closed during standard operation.
To prevent DataScale SN30 rack components from overheating, keep the front and rear of the rack clear of obstructions to allow proper airflow.
Before powering on the DataScale SN30 rack, read the SambaNova DataScale SN30 Rack Release Notes, included in SN30 hardware installation (at https://docs.sambanova.ai) to ensure that you understand any known issues or limitations. If you do not read the release notes, you might incorrectly configure the system components or software, which might necessitate a factory reset.
Do not power off or reboot the DataScale SN30 rack components during any firmware update procedure. Doing so might damage the DataScale SN30 rack components, and damaged components might not be recoverable. Perform a shutdown or reboot only after a firmware update has been completed.
When the PDUs are physically connected to the datacenter’s power receptacles and power is applied to the rack, all DataScale SN30 rack components begin to power on. The fans of these components initially run at full speed but eventually ramp down after the BMCs finish their boot sequence. Power is not immediately applied to the rack components because the breakers on the PDUs are turned off. You must manually turn on these breakers to begin feeding power to the DataScale SN30 rack components.

3.2. Process overview

To avoid damage to the system, perform the power-on procedure or a graceful shutdown in the correct order. Here’s an overview.

To turn on the DataScale SN30 rack, follow the detailed steps below. Here’s an overview:

  1. Power on the DataScale SN30 rack by turning on the circuit breakers for each PDU.

  2. Boot the DataScale SN30-2 RDU modules

  3. Boot the DataScale SN30-H host module

To gracefully shut down the DataScale SN30 rack, follow the detailed steps in Gracefully shutting down the DataScale SN30 rack. Here’s an overview:

  1. Shut down the SN30-H host modules

  2. Shut down the DataScale SN30-2 RDU modules

3.3. Power on the DataScale SN30 rack

Power on the DataScale SN30-2 RDU modules before you power on the DataScale SN30-H host modules, as described in the following steps.
  1. Turn on the six circuit breakers for each PDU.

    When the PDUs are plugged into the datacenter power and you close the circuit breakers, power is automatically applied to the DataScale SN30 rack components. Circuit breakers on PDU shows what a PDU circuit breaker group looks like and shows breaker switch 6 circled. Each PDU has a bank of three circuit breakers grouped together.

    Circuit breaker on PDU
    Figure 2. Circuit breakers on PDU

    The DataScale SN30-H host modules and DataScale SN30-2 RDU modules boot into standby mode and wait to be manually powered on. The BMC/service processors are powered on through these devices. The networking equipment in the rack does not go into standby mode; instead, it completely boots when power is established.

    SambaNova uses networking equipment from other suppliers. See Third-party documentation.

3.4. Boot the DataScale SN30-2 modules

Boot the DataScale SN30-2 RDU modules by using SSH to connect to the SN30-2 BMC, or by sending an API call to the SN30-2 BMC. This section includes steps for both options.

3.4.1. Option 1: Use SSH to connect to the SN30-2 BMC

  1. From a system that has access to the DataScale SN30 rack access network, open a terminal session and use ssh to securely connect to the first DataScale SN30-2 RDU module in each system.

    See the IP address assignment information in Network administration or use your customer-specific IP assignment worksheet to get the IP address to connect to. The first DataScale SN30-2 RDU module in each system is as follows:

    System 1: SN30-2-1 (SN30-H-1-XRDU0)

    System 2: SN30-2-5 (SN30-H-1-XRDU0)

    Here’s an example for system 1 that assumes IP address subnet 10.0.1.0/26 for the access network:

    $ ssh root@10.0.1.25
    root@10.0.1.25’s password: <Enter root password>
    root@xrdu:~#
  2. Run the following xrduutil command to power on the system:

    root@xrdu:~# xrduutil -U root -P <root_password> poweron
  3. To ensure the DataScale SN30-2 RDU modules are up before you boot the DataScale SN30-H host module, check the status of each of each module by running this command:

    root@xrdu:~# xrduutil -U root -P <root_password> powerstate
    Power is on for XRDU_0
    Power is on for XRDU_1
    Power is on for XRDU_2
    Power is on for XRDU_3

3.4.2. Option 2: Send a REST API call to the SN30-2 BMC

  1. Generate a token (recommended). If you use the REST API, SambaNova recommends that you use token-based authentication so that plain-text passwords are not sent over the network for REST API commands. See Generate a secure API login token for details.

  2. Run the REST API power-on command for each DataScale SN30-2 RDU module. Run this command for each DataScale SN30-2 RDU module in each of the nodes, in no particular order.

    Format:

    $ curl -b cjar -k -H "X-Auth-Token: $token" -X PUT -d '\{"data":"xyz.openbmc_project.State.Chassis.Transition.On"}' https://<SN30-2_BMC_IP>/xyz/openbmc_project/state/chassis0/attr/RequestedPowerTransition

    Example:

    $ curl -b cjar -k -H "X-Auth-Token: $token" -X PUT -d '\{"data":"xyz.openbmc_project.State.Chassis.Transition.On"}' https://10.0.1.21/xyz/openbmc_project/state/chassis0/attr/RequestedPowerTransition
  3. To ensure the DataScale SN30-2 RDU modules are up before you boot the SN30-H, run the following command against each of the DataScale SN30-2 RDU modules:

    Format:

    $ curl -b cjar -k -H "X-Auth-Token: $token" https://<SN30-2_BMC_IP>/xyz/openbmc_project/state/chassis0

    Example:

    $ curl -b cjar -k -H "X-Auth-Token: $token" https://10.10.0.25/xyz/openbmc_project/state/chassis0

    After an SN30-2 RDU module is powered on, the output looks similar to the following:

    {
    "data": {
    "CurrentPowerState": "xyz.openbmc_project.State.Chassis.PowerState.On",
    "LastStateChangeTime": 1591197275103,
    "POHCounter": 75,
    "RequestedPowerTransition": "xyz.openbmc_project.State.Chassis.Transition.On"
    },
    "message": "200 OK",
    "status": "ok"
    }

3.4.3. Option 3: Mechanical power-on

To power on the SN30-2 resources:

  1. Press the power button located on the front panel of the SN30-2 for 5 seconds. This panel is located on the front left side of the system. The power button is identified as item 1 in SN30 front panel (annotated).

  2. Wait for the system LED (callout item 2) to go from blinking to solid green light.

    SN30 front panel
    Figure 3. SN30 front panel (annotated)
  3. When the system LED is no longer blinking, the SN30-2 resources are being powered on. This power on process can take up to a minute.

  4. Repeat the process for each SN30-2 system in the SN30-8 node.

3.5. Power on the DataScale SN30-H host module

To ensure that the DataScale SN30-H host module populates the system device tree properly, power on the host module only after the DataScale SN30-2 RDU modules are powered on fully.

Boot the DataScale SN30-H host module using either mechanical power on, or power on via IPMI, or power on via the Web UI. This section discusses each option.

3.5.1. Option 1: Mechanical power on

To power on the SN30-H host module, press the power button located on the front panel of the SN30-H. This panel is located on the front left side of the server.

Power button

3.5.2. Option 2: Power on via IPMI

Run the following command from a system that has ipmitool installed and that has access to the SN30-H host module’s BMC via the access network.

$ ipmitool -I lanplus -H <SN30-H_BMC_IP_Address> -U root -P <root password> power on

3.5.3. Option 3: Power on via WebUI

To power on via WebUI your system must meet the following requirements:

  • Access to the DataScale SN30-H host module’s BMC via the access network

  • One of the following supported web browsers:

    • Chrome (latest version)

    • Firefox (latest version)

Follow these steps:

  1. Open a web browser.

  2. In the browser’s address bar, enter the IP address of the SN30-H host module’s BMC.

  3. Log in to the management console by entering the user credentials.

    Login screen

  4. Click Sign me in.

  5. Select Power Control from the BMC dashboard.

    Dashboard

  6. Select the Power On checkbox, and then click Perform Action.

    Power On

  1. Perform this boot sequence for all nodes in the DataScale SN30 rack. The order in which you bring up the nodes does not matter.

3.6. Gracefully shutting down the DataScale SN30 rack

You can shut down the DataScale SN30 rack but not completely power off the entire rack. Follow these steps for each node in the DataScale SN30 rack.

3.6.1. Shut down the SN30-H host modules

Shut down the SN30-H host module in each system by using one of the following methods:

Option 1: Shut down from the OS

Log in to the node via ssh as snuser1 and initiate a shutdown command.

$ ssh snuser1@<SN30-H_OS_IP_Address>
snuser1@SN30-H1’s password: <password>
$ sudo shutdown

This command does not shut down the system immediately but waits about a minute for users to save their work.

Option 2: Power off via IPMI
  1. Ensure that your system has:

    • Access to the SN30-H host module’s BMC via the access network

    • The ipmitool installed

  2. Run the following command:

$ ipmitool -I lanplus -H <SN30-H_BMC_IP_Address> -U root -P <root password> power off
Option 3: Power off via WebUI

To power off via WebUI your system must meet the following requirements:

  • Access to the DataScale SN30-H host module’s BMC via the access network

  • One of the following supported web browsers:

    • Chrome (latest version)

    • Firefox (latest version)

Follow these steps:

  1. Open a web browser.

  2. Enter the IP address of the SN30-H host module’s BMC in the browser’s address bar.

  3. Log in to the management console with your user credentials.

    Login screen

  4. Click Sign me in.

  5. Select Power Control from the BMC dashboard.

    BMC dashboard

  6. In the Power Actions screen, select the Power Off checkbox and click Perform Action.

    Power Off

3.6.2. Shut down the DataScale SN30-2 RDU modules

Shut down the DataScale SN30-2 RDU modules in the node using one of the following methods:

Option 1: Use SSH to connect to the DataScale SN30-2 BMC
  1. Open a terminal session from a system that has access to the DataScale SN30 rack access network

  2. Use ssh to connect to the first DataScale SN30-2 in each node.

    See the IP address assignment information in Network administration or use your customer-specific IP assignment worksheet to get the IP address to connect to. The first DataScale SN30-2 RDU module in each system is as follows:

    System 1: SN30-2-1 (SN30-H-1-XRDU0)

    System 2: SN30-2-5 (SN30-H-1-XRDU0)

    Example for system 1 given IP address subnet 10.0.1.0/26 for the access network:

    $ ssh root@10.0.1.25
    root@10.0.1.25’s password: <Enter root password>
    root@xrdu:~#
  3. Run the xrduutil poweroff command:

    root@xrdu:~# xrduutil -U root -P <root_password> poweroff
Option 2: Send a REST API call to the DataScale SN30-2 BMC

SambaNova recommends that you use token-based authentication so that you do not send plain-text passwords over the network for REST commands. See Generate a secure API login token. After you generate the token, start shutting down the components:

  1. Run the REST API power-off command for each of the DataScale SN30-2 RDU modules in each of the systems.

    Format:

    $ curl -b cjar -k -H "X-Auth-Token: $token" -X PUT -d '\{"data":"xyz.openbmc_project.State.Chassis.Transition.Off"}' https://<SN30-2_BMC_IP>/xyz/openbmc_project/state/chassis0/attr/RequestedPowerTransition

    Example:

    $ curl -b cjar -k -H "X-Auth-Token: $token" -X PUT -d '\{"data":"xyz.openbmc_project.State.Chassis.Transition.Off"}' https://10.0.1.25/xyz/openbmc_project/state/chassis0/attr/RequestedPowerTransition
  2. Shut down the Juniper QFX5130 high-bandwidth data switch, the Lantronix SLC8000 serial console server, and the Juniper EX series access switch.

    When you power down the entire DataScale SN30 rack, shut down the Juniper EX series access switch last, because that switch controls the final access to the system via the network.

    See the product-specific documentation listed under Third-party documentation for information on how to shut down each of these network devices:

After shutting down these switches, you can no longer access the PDUs to cycle outlets because their network switch is down. You have to break and manually remake the relevant breakers from the physical PDU to properly cycle power.

4. Host module OS administration

Administrative tasks differ depending on which supported OS you are running on each of the SN30-H host modules.

4.1. Supported versions of the SN30-H operating systems

The SN30-H host module supports the following OS versions:

  • Red Hat Enterprise Linux 8.5

  • Ubuntu Server 20.04.2 Long-Term Support (LTS)

4.2. General notes and warnings

Some third-party software and OS packages may prevent the SambaFlowTM software stack from functioning properly. In this case, SambaNova Support may require all non-certified third-party software or non-certified packages, including the package version, to be removed to get the DataScale® SN30-H host module to a satisfactory state and to continue working on any support issues.
DataScale SN30-H host modules are configured with a default login password for users root and snuser1. SambaNova strongly recommends that you change these passwords immediately after logging in to a DataScale SN30-H host module.
SambaNova strongly recommends that you do not perform a major upgrade or a kernel update to the DataScale SN30-H host module OS without referring to the supported OS, kernel, and package versions noted within this document and the software release notes because the SambaNova software relies on some strict packages dependencies. SambaNova recommends that you do not perform any major updates unless you are directed to do so by SambaNova.
Before you perform Linux package updates, check the SambaFlow software release notes to ensure there are no package dependencies that might break the SambaFlow software if the packages are not at the correct level.

4.3. Licensing

SambaNova provides the package repositories for Red Hat Enterprise Linux and for Ubuntu running on the DataScale SN30 rack.

  • SambaNova has a partnership with Red Hat that allows SambaNova to distribute a customized repository for the DataScale SN30 rack.

  • SambaNova has a partnership with Ubuntu that allows SambaNova to distribute a customized repository for the DataScale SN30 rack.

Adding other repositories can cause issues with the operation of the SambaFlow software because of some package and kernel version dependencies.

If the SambaNova software stack has problems running, SambaNova Support might request that you remove any packages that were not originally included from the your Linux repository or that you downgrade certain packages to a version that was certified.

4.4. Login process

To access the DataScale SN30-H host module for the first time:

  1. Find a system that can access the DataScale SN30 rack access network. The access network might be combined with the management or data network.

  2. Use ssh as user snuser1 to log in to the DataScale SN30-H host module.

  3. Enter the default password for snuser when prompted. See Default username and passwords for components.

$ ssh snuser1@<SN30-H_OS_IP_Address>
snuser1@<SN30-H_OS_IP_Address>’s password: <Default Password>

SambaNova strongly recommends that you change the default password for root and snuser1. To change the snuser1 password, run the following command and enter the new password when prompted:

$ passwd
Changing password for snuser1.
(current) UNIX password: <Current_Default_Password>
Enter new UNIX password: <New_Secure_Password>
Retype new UNIX password: <New_Secure_Password>
passwd: password updated successfully

4.5. Connect to the SambaNova OS repository

DataScale SN30-H host module connectivity to the SambaNova repository is set up as part of the DataScale SN30 rack installation and relies on the site survey that your company completed. As part of the initial installation, SambaNova provides a sambanova.repo file that contains the appropriate credentials and paths to your specific repository.

If you need to check the setup for the SambaNova OS repository, see KB article #1057.

4.6. OS repository configuration file

Do not modify the sambanova.repo repository file. Doing so can break SambaFlow software package dependencies, which might cause unrecoverable package dependency issues. You might have to rebuild the SN30-H host module as a result. If you need any packages that are not provided by SambaNova, open a support case with SambaNova Support.

4.7. Updating the DataScale SN30-H host module OS

SambaNova patch releases handle major upgrades to the DataScale SN30-H host module OS, for example:

  • Going from RHEL 8.5 to RHEL 8.6 or later

  • Going from 20.04 LTS to 22.04 LTS

  • Kernel updates.

See the SambaFlow Release Notes documentation for information about commands that you need to run to perform the upgrade.

4.8. Updating the SambaFlow software

To update the SambaFlow software packages, log in to the DataScale SN30-H host module(s) where the software packages need to be updated. The commands you run depend on the OS you’re using.

4.8.1. Update SambaFlow on RHEL

To view what packages are installed on the DataScale SN30-H host module, run the following command:

$ dnf list installed | grep samba[nf]

To view which SambaFlow packages have an update that you can apply, run the following command:

$ dnf check-update | grep samba[nf]

To update the SambaFlow packages, examine the check-update command output, and then run the following command to update a package and any package dependencies:

$ sudo dnf update <package_name>

For example, if the output produced by the check-update command shows that an update is available for the sambaflow package, run the following command:

$ sudo dnf update sambaflow

Repeat this step for each package that needs to be updated. Due to package dependencies, updating one package might update several other packages.

4.8.2. Update SambaFlow on Ubuntu

To update the SambaFlow software packages, log in to the DataScale SN30-H host module(s) where the software packages need to be updated.

To view what packages are installed on the DataScale SN30-H host module, run the following command:

$ dpkg -l | grep samba[nf]

To view which SambaNova packages have an update you can apply, run the following command:

$ apt list --upgradable | grep samba[nf]

To update all the packages that need to be updated, run the following command, which updates the packages and any package dependencies:

$ sudo apt install --only-upgrade samba[nf]

To update a specific package, replace samba[nf] with the name of a specific package. For example, to update sambaflow, run the following command:

$ sudo apt install --only-upgrade sambaflow

5. BMC administration

When security patches are available or when BMC firmware updates are required for other reasons, you can perform the tasks in this section. Updating the BIOS is included with this BMC administration topic because the two tasks are usually performed at the same time. The tasks include:

  • Updating the DataScale® SN30-H host module BMC firmware

  • Updating the DataScale SN30-H host module BIOS

  • Recovering the DataScale SN30-H BMC

See View SN30-H BMC diagnostic information and logs for information on diagnostics.

5.1. General notes and warnings

Do not remove the admin user account or change this account’s password. This account is needed for password recovery of the DataScale SN30-H host module’s BMC.
Do not power off or reboot the DataScale SN30 rack components during firmware updates. Interrupting a firmware update can damage the DataScale SN30 rack components. The damaged component might not be recoverable. Perform a shutdown or reboot only after a firmware update has been completed successfully.
Settings on the BMCs do not need modification and remain static unless you are updating the BMCs, collecting diagnostic material, or changing the log in credentials. Do not make configuration changes to the BMC unless you are otherwise instructed.

5.2. Updating the DataScale SN30-H host module BMC firmware

If you start the firmware update process and you decide to cancel the process, you must reset BMC. To do that, close the web browser that was logged in to the BMC WebUI, and then log in to the BMC WebUI again before you attempt any administrative operations for the BMC.

5.2.1. Back up the existing configuration

Before you update the firmware, back up the existing configuration of the DataScale SN30-H host module. Having a backup might help with recovering the BMC.

To back up the existing configuration, your system must meet the following requirements:

  • Access to the DataScale SN30-H host module’s BMC via the access network

  • One of the following supported web browsers:

    • Chrome (latest version)

    • Firefox (latest version)

Follow these steps to back up the existing configuration:

  1. Open a web browser.

  2. In the browser’s address bar, enter the IP address of the DataScale SN30-H host module’s BMC, and log in to the management console with your user credentials, and click Sign me in.

    Login screen

  3. In the dashboard, select Maintenance.

    Dashboard

  4. On the Maintenance screen, select Backup Configuration.

    Maintenance screen

  5. On the Backup Configuration screen, select Check All to back up all the BMC configuration details.

    Backup Configuration screen

  6. Click Download to save this configuration to the local system (which is accessing the BMC WebUI).

  7. Click OK to download the bmc-config.bak backup configuration file. You can use that file later if a restore is required.

    Download screen

5.2.2. Update the host moducle BMC firmware

Now that you have backed up the BMC configuration, you can update the SN30-H host module’s BMC firmware while preserving the configuration. Follow these steps:

  1. Download the DataScale SN30-H host module’s BMC patch update from the SambaNova Support portal to the local system that is accessing the BMC WebUI.

  2. Unzip the SambaNova patch update to a directory on the local system.

  3. On the Backup Configuration screen, select Maintenance in the left pane.

    Backup Configuration screen

  4. On the Maintenance screen, select Preserve Configuration.

    Maintenance screen

  5. Select Check All at the top of the list to preserve the configuration of everything.

    Preserve Configuration screen

    The following message appears if the configuration preservation was successful.

    Success message

  6. In the left pane, click Maintenance.

    Left pane

  7. In the Maintenance screen, select Firmware Update.

    Maintenance screen

  8. Find the rom.ima_enc file:

    1. In the Firmware Update screen, click Browse.

      Firmware Update screen

    2. Navigate to the .bin file that you downloaded and unzipped. This file is located in the /SN30 rack/<version>/HostBMC_FW/ directory from the unzipped patch bundle.

    3. Select the rom.ima_enc file and click Open.

      rom.ima_enc file

  9. Back in the Firmware Update screen, click Start firmware update.

    Firmware Update screen

  10. Below the the button that you just clicked, select the Preserve all Configuration checkbox to use the preserved configuration you saved.

    Preserve all Configuration

  11. Scroll to the bottom of the screen and click Proceed to Flash.

    Proceed to Flash

  12. Click OK in the BMC update confirmation screen.

    BMC update confirmation screen

    When the BMC update process has started, the BMC is not reachable for 5 to 10 minutes while the update is being applied. The DataScale SN30-H host module OS continues to run normally during the BMC update.

    After 10 minutes, repeat step 2 to log in to the BMC WebUI, and confirm that the update was successful by checking the information in the upper left side of the dashboard. The BMC firmware version is identified as <XX.XX.X>.

    BMC firmware version

5.3. Update the DataScale SN30-H host module BIOS

After you enter the update mode, the widgets and other web pages and services will not work. All the open widgets will be automatically closed. If you cancel the upgrade in the middle of the process, the SN30-H host module will be reset only for the BMC BOOT and APP components of the firmware. Therefore, ensure the update process is not interrupted.
The SN30-H host module BIOS update requires a reboot of the system to apply the updated BIOS. Plan accordingly.

To update the SN30-H host module BIOS, your system must meet the following requirements:

  • Access to the DataScale SN30-H host module’s BMC via the access network

  • One of the following supported web browsers:

    • Chrome (latest version)

    • Firefox (latest version)

Follow these steps to perform the update:

  1. Open a web browser.

  2. In the browser’s address bar, enter the IP address of the DataScale SN30-H host module’s BMC, enter your user credentials, and click Sign me in.

    Login screen

  3. In the dashboard, select Maintenance.

    Dashboard

  4. In the Maintenance screen, select Firmware Update.

    Maintenance screen

  5. Find the image.RBU file:

    1. In the Firmware Update screen, click Browse.

      Firmware Update screen

    2. Navigate to the /Host_BIOS/RBU/ directory of the uncompressed infrastructure patch bundle.

    3. Select the image.RBU file and click Open.

      image.RBU file

  6. Back in the Firmware Update screen, click Start firmware update.

    Firmware Update screen

  7. Below the button you clicked, select BIOS from the Update Type drop-down.

    Update Type drop-down list

  8. Click Proceed to Flash.

    Proceed to Flash button

  9. Click OK.

    Firmware upgrade confirmation screen

    This initiates uploading the BIOS firmware update to the DataSale SN30-H host module, but it does not automatically apply the firmware update.

    Update initiated

  10. When the screen shows Uploading 100%, click Flash BIOS.

    Flash BIOS button

    This initiates the BIOS update process.

    Flash process screen

  11. When the flash process is complete, a “firmware image has been updated successfully” message appears. Click OK to continue.

    Success message

  12. A "Firmware reset has been called" message appears. Click OK to log out of the SN30-H BMC WebUI.

    Firmware reset message

5.3.1. Reset the host module OS

As a final step, you have to reset the host module OS.

  1. After you are logged out of the SN30-H BMC, log in to the SN30-H OS.

    $ ssh snuser1@<SN30-H_OS_IP_Address>
    snuser1@<SN30-H_OS_IP_Address>’s password: <snuser1 Password>
  2. From the command line, reset the SN30-H OS to complete the BIOS update.

    $ sudo shutdown -r now
    [sudo] password for snuser1: <snuser1 Password>
  3. When the SN30-H host module is back online, confirm that the BIOS update has been applied, as follows:

    1. Log in to the SN30-H BMC and select Maintenance from the left pane of the dashboard.

      Dashboard screen

    2. In the Maintenance screen, select Firmware Information.

      Maintenance screen

    3. Under BIOSFirmware Information check the BMCFirmware Information screen for the firmware version.

      BMCFirmware Information screen

5.4. Recover the DataScale SN30-H BMC

If the DataScale SN30-H host module’s BMC is no longer responding or no longer accessible, or the DataScale SN30-H host module’s BMC password has been lost or forgotten, see Backing up and restoring components.

6. DataScale SN30 RDU module administration

Administrative tasks for the DataScale® SN30-2 RDU module’s BMC include the following:

  • Changing the root password

  • Generating a secure API login token for authentication

  • Updating the DataScale SN30-2 BMC and RDU controller (RDU-C) firmware

  • Configuring the DataScale SN30-2 BMC network

  • Configuring the DataScale SN30-2 BMC hostname

There is a built-in secure account on the DataScale SN30-2 BMC called snservice. It is used for password recovery of root if the password is forgotten. For more details on this account, refer to KB article #1049.

6.1. Change the root password

SambaNova highly recommends that you change the default password for root to a more secure password.
Passwords cannot be based on dictionary words and cannot include the # character. If you use a dictionary word, a BAD PASSWORD message results, and the password is not changed.

To change the default password for root on the DataScale SN30-2 BMC, follow these steps:

  1. Log in to the DataScale SN30-2 BMC where you transferred the update files:

    $ ssh root@<SN30-2_BMC_IP_Address>
    Password: <Enter root password>
  2. Run the passwd command and enter a new password, as follows:

    root@xrdu:~# passwd
    New password: <New Password>
    Retype new password: <New Password>
    passwd: password updated successfully

6.2. Generate a secure API login token

You can generate a secure token for the DataScale SN30-2 BMC root user to prevent the need to use plain-text passwords in REST API calls.

  1. Log in to the client system from which you want to run the REST API calls. The system must have network access to the DataScale SN30-2 BMC.

  2. Run the following command to generate the token. Replace <SN30-2_BMC_IP_Address> and <Password> with the appropriate values:

    $ export token=`curl -k -H "Content-Type: application/json" -X POST https://<SN30-2_BMC_IP_Address>/login -d '\{"username" : "root", "password" : "<Password>"}' | grep token | awk '\{print $2;}' | tr -d '"'`
  3. Confirm that a token has been generated for your session:

    $ echo $token
    1h0Dk9xjtjsOtBkMhgIN
  4. To validate that the token works from the client system, run the following cURL command. Replace <SN30-2_BMC_IP_Address> with the correct DataScale SN30-2 BMC IP address.

    $ curl -k -H "X-Auth-Token: $token" https://<SN30-2_BMC_IP_Address>/xyz/openbmc_project/
    {
    "data":
    "/xyz/openbmc_project/Ipmi",
    "/xyz/openbmc_project/certs",
    ...
    "/xyz/openbmc_project/user"
    ],
    "message": "200 OK",
    "status": "ok"
    }

    If you execute the cURL command correctly and output that’s similar to the example is generated, the token works correctly. You can now use the token with other API calls, for example, to power on and power off the DataScale SN30-2 RDU module.

6.3. Updating the DataScale SN30-2 BMC and RDU controller (RDU-C) firmware

Updating the DataScale SN30-2 BMC and RDU controller (RDU-C) firmware requires several tasks, which must be done in sequence.

6.3.1. Prepare the DataScale SN30-2 BMC primary partition for update

To prepare the primary partition and download the files, follow these steps:

  1. Shut down the DataScale SN30-H host module in the system. This will ensure that there are no graphs running or any other load. See the Gracefully shutting down the DataScale SN30 rack procedure.

  2. Shut down the DataScale SN30-2 RDU module. See the Gracefully shutting down the DataScale SN30 rack procedure.

  3. Log in to the DataScale SN30-2 BMC and reboot the BMC to clear the BMC registers, as follows:

    $ ssh root@<SN30-2_BMC_IP_Address>
    Password: <Enter root password>
    
    root@xrdu:~# reboot

    This reboot process takes about 3-5 minutes to complete. You can progress to the next step to download the DataScale SN30-2 firmware update.

  4. Download the DataScale SN30-2 firmware update file sn<XRDU_version>-xrdu-sys-fw-<fw_version_number>.tar.gz from the SambaNova ext-xrdu-fw repository, under the /latest sub-directory, to a system that has access to the network that the DataScale SN30-2 BMC is on. For details on accessing these required firmware files, see the KB Article #1063.

Ensure you download the XRDU firmware specific to the DataScale SN30 and not the firmware specific to other DataScale versions.
  1. Uncompress the sn<XRDU_version>-xrdu-sys-fw-<fw_version_number>.tar.gz file.

  2. Copy the .mtd and .mtd.md5 firmware files from the obmc/ directory to each of the DataScale SN30-2 BMCs that are to be updated. Place these files under the /dev/shm/ directory on the SN30-2.

    $ scp /<uncompressed directory>/obmc/obmc-<version>* root@<SN30-2_BMC_IP_Address>:/dev/shm/
    Password: <Enter root password>

    Confirm that the .mtd and .mtd.md5 files have been completely transferred to the BMC’s /dev/shm/ directory.

    Ensure that the files copied over are from the rdu-128 directory and not the rdu-64 directory.
  3. Log in to the DataScale SN30-2 BMC where the update files were transferred to.

    $ ssh root@<SN30-2_BMC_IP_Address>
    Password: <Enter root password>
    
    root@xrdu:~# cd /dev/shm/
  4. Confirm that the following two files are located in this directory:

    • obmc-rdu-<version>.mtd

    • obmc-rdu-<version>.mtd.md5

    root@xrdu:/dev/shm# ls obmc*
    obmc-<version>.mtd  obmc-<version>.mtd.md5

6.3.2. Perform the update on the primary partition

After you confirm that the two files are available, perform the update as follows:

  1. Run the update on the obmc-rdu-<version>.mtd firmware file.

    root@xrdu:~# obmcupdate -p primary -t bmc -f /dev/shm/obmc-rdu-<version>.mtd

    Do not run any other commands or disconnect the power supply at this time .

  2. Confirm that the Erasing, Writing, and Verifying stages complete to 100%.

  3. When all stages are completed, reboot the BMC with the new firmware.

    root@xrdu:~# reboot -f
  4. After about 3 to 5 minutes, log in to the DataScale SN30-2 BMC.

    $ ssh root@<SN30-2_BMC_IP_Address>
    Password: <Enter root password>
    The update reimages the DataScale SN30-2 BMC and the .ssh identification will likely have changed. You might be prompted to remove the old host entry in the .ssh/known_hosts file on the client that was used to ssh into the system before.
  5. Confirm the update has been running and compare the version output to the DataScale SN30-2 BMC firmware patch applied, as follows:

    root@xrdu:~# obmcupdate -i
    ***** RDU-C *****
    RDU-C Release Version: <current version>
    RDU-C BuildDate: #.## ####   DesignVer: ##   BoardID: ##.
    ***** BMC *****
    BMC Release Version: <updated version>
    BMC BUILD ID: <updated BMC buildid>
    BMC Flash: Primary
    BMC Flash Size: 128MB
  6. If there are any issues running the update, run the obmcupdate command again.

If the update process continues to fail, contact SambaNova Support.

6.3.3. Update the DataScale SN30-2 BMC secondary/recovery partition

The re-imaging of the BMC removes the obmc-rdu-<version>.mtd and obmc-rdu-<version>.mtd.md5 files from /dev/shm/.

  1. Exit out of the SN30-2 BMC and log back in to the client system where the BMC firmware files were uncompressed.

  2. Copy the obmc-rdu-<version>.mtd and obmc-rdu-<version>.mtd.md5 firmware files back to the DataScale SN30-2 BMCs /dev/shm/ directory.

    $ scp /<uncompressed directory>/obmc/obmc-<version>* root@<SN30-2_BMC_IP_Address>:/dev/shm/
    Password: <Enter SN30-2 BMC root password>
  3. Confirm that these two files have been completely transferred to the BMC’s /dev/shm/ directory.

  4. Log back in to the DataScale SN30-2 BMC that was just updated:

    $ ssh root@<SN30-2_BMC_IP_Address>
    Password: <Enter root password>
  5. Go to the /dev/shm/ directory on the DataScale SN30-2 BMC.

    root@xrdu:~# cd /dev/shm/
  6. Confirm that the following two files are located in this directory:

    • obmc-rdu-<version>.mtd

    • obmc-rdu-<version>.mtd.md5

      root@xrdu:/dev/shm# ls obmc*
      obmc-rdu-<version>.mtd  obmc-rdu-<version>.mtd.md5
  7. Run the update on the BMC recovery partition using the obmc-rdu-<version>.mtd firmware file.

    root@xrdu:~# obmcupdate -p recovery -t bmc -f /dev/shm/obmc-rdu-<version>.mtd

    Do not run any other commands or disconnect the power supply at this time.

  8. Confirm that the Erasing, Writing, and Verifying stages complete to 100%.

  9. If there are any issues running the update, run the update command once more. If the update process continues to fail, contact SambaNova Support.

When update is completed, you can update the DataScale SN30-2 RDU Controller (RDU-C) primary partition.

6.3.4. Update the DataScale SN30-2 RDU-C primary partition

After you’ve update both primary and secondary partition of the SN30-2 BMU, you can update the SN30-2 RDU-C.

  1. Exit out of the SN30-2 BMC and log back in to the client system where the BMC and RDU-C firmware files were uncompressed.

  2. Copy the following firmware files to the DataScale SN30-2 BMCs /dev/shm/ directory:

    • rduc-<version>-primary.spi

    • rduc-<version>-primary.spi.md5

    • rduc-<version>-recovery.spi

    • rduc-<version>-recovery.spi.md5

      $ scp /<uncompressed directory>/rduc/rduc-<version>-* root@<SN30-2_BMC_IP_Address>:/dev/shm/
      Password: <Enter SN30-2 BMC root password>
  3. Log in to the DataScale SN30-2 BMC to which the update files were transferred.

    $ ssh root@<SN30-2_BMC_IP_Address>
    Password: <Enter root password>
  4. Go to the /dev/shm/ directory on the DataScale SN30-2 BMC.

    root@xrdu:~# cd /dev/shm/
  5. Confirm that the following files are located in this directory:

    • rduc-<version>-primary.spi

    • rduc-<version>-primary.spi.md5

    • rduc-<version>-recovery.spi

    • rduc-<version>-recovery.spi.md5

      root@xrdu:/dev/shm# ls rduc*
      rduc-<version>-primary.spi  rduc-<version>-primary.spi.md5  rduc-<version>-recovery.spi
      rduc-<version>-recovery.spi.md5
  6. Run the update using the primary.spi firmware file to update the DataScale SN30-2 RDU-C primary partition.

    root@xrdu:/dev/shm# obmcupdate -p primary -t rduc -f /dev/shm/rduc-<version>-primary.spi

    Do not run any other commands or disconnect the power supply at this time.

  7. Confirm that the update of the RDU-C has taken affect by running the obmcupdate -i command.

    root@xrdu:~# obmcupdate -i
    ***** RDU-C *****
    RDU-C Release Version: <updated version>
    RDU-C BuildDate: #.## ####   DesignVer: ##   BoardID: ##
    ***** BMC *****
    BMC Release Version: <updated version>
    BMC BUILD ID: <updated build id>
    BMC Flash: Primary
    BMC Flash Size: 128MB

    Verify that the RDU-C Release Version appears as the updated version.

6.3.5. Update the DataScale SN30-2 RDU-C secondary/recovery partition

  1. To update the the DataScale SN30-2 RDU-C recovery partition, run the obmcupdate command with the rduc-<version>-recovery.spi firmware file.

    root@xrdu:/dev/shm# obmcupdate -p recovery -t rduc -f /dev/shm/rduc-<recovery>-recovery.spi
  2. If any issues occur during the update of the DataScale SN30-2 BMC or RDU-C, contact SambaNova support

After the DataScale SN30-2 BMC and RDU-C have successfully been updated, it is safe to power on the DataScale SN30-2 and then the SN30-H modules. See the Power on the DataScale SN30 rack procedure.

6.4. Configure the DataScale SN30-2 BMC network

When you change the IP address of the DataScale SN30-2 BMC, you have to update the IP_ADDRESS_SP# entries in the /platform/network.json files for the updated DataScale SN30-2 BMC and update other DataScale SN30-2 BMCs that are directly connected to the updated DataScale SN30-2 BMC in the node.
After changing the IP address and resetting the network service, currently connected ssh sessions are terminated or left in a hung state because the network IP connection has changed. Log in to the DataScale SN30-2 BMC using the new IP address.

DataScale SN30-2 BMC networking is configured as part of the DataScale SN30 rack delivery. It’s not usually necessary to modify the network configuration upon delivery, although there might be situations where the network has to be reconfigured later.

You can change the network settings by running the network-settings command, as shown below. Table 5 describes the command options.

root@xrdu:~# network-settings [-h] -i [IPADDRESS] -n [NETMASK] -g [GATEWAY] -d [DNS] [{static,DHCP}]
Table 5. Command options for network-settings
Option Function

{static,DHCP}

Specify the network mode.

-h
--help

Show the help message and exit.

-i [IPADDRESS]
--ipAddress [IPADDRESS]

IP address for static connection.
Example: "10.10.0.0". Use "" for DHCP.

-n [NETMASK]
--netMask [NETMASK]

Netmask number for static network mode (between 0 to 32). Use any number for DHCP.

-g [GATEWAY]
--gateWay [GATEWAY]

Gateway for static connection.
Example: "10.10.0.0". Use "" for DHCP.

-d [DNS]
--dns [DNS]

DNS for static connection.
Example: "10.10.0.0". Use "" for DHCP.

  1. Set the IP address configuration using the network-settings command.

    Example 1: Set a static IP address of 10.10.0.15 on a /24 subnet with gateway address 10.10.0.1 and a DNS server on 10.0.0.13:

    root@xrdu:~# network-settings -i "10.10.0.15" -n 24 -g "10.10.0.1" -d "10.0.0.13" static
    Modifiying network settings ...
    Toggling network settings ...

    Example 2: Set the network mode to DHCP:

    root@xrdu:~# network-settings -i "" -n 0 -g "" -d "" DHCP
    Modifiying network settings ...
    Toggling network settings ...
  2. After you successfully run the command, restart the network service to ensure that the configuration is set and running:

    root@xrdu:~# systemctl restart systemd-networkd.service

    At this point, the current ssh session should have been terminated or be in a hung state.

  3. Open a new terminal and log in to the DataScale SN30-2 BMC:

    $ ssh root@<SN30-2_New_BMC_IP_Address>
    Password: <Enter root password>
  4. To confirm the IP address configuration, run the ip address command. In the command output, the assigned IP address appears as the second inet value under eth0.

    root@xrdu:~# ip address
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
    valid_lft forever preferred_lft forever
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff
    inet 169.254.192.89/16 brd 169.254.255.255 scope link eth0
    valid_lft forever preferred_lft forever
    inet 10.10.0.15 brd 10.10.0.255 scope global dynamic eth0
    valid_lft 40746sec preferred_lft 40746sec
    inet6...

6.5. Configure the DataScale SN30-2 hostname

To configure or modify the DataScale SN30-2 hostname, follow these steps:

  1. Log in to the DataScale SN30-2 BMC:

    $ ssh root@<SN30-2_BMC_IP_Address>
    Password: <Enter root password>
  2. Run the following command to configure or modify the DataScale SN30-2 hostname:

    root@xrdu:~# hostnamectl set-hostname <hostname>
  3. To see the new hostname, log out and log back in to the DataScale SN30-2 BMC.

7. Monitor and debug the DataScale SN30 rack

The DataScale® SN30 rack supports standard methods to monitor and triage the system. This page includes some tasks you can perform, such as examining log files, and also explains how collect diagnostic information for use with SambaNova support.

7.1. Overview of tools and logs

Several tools and logs can help you resolve problems. Here’s an overview:

Table 6. Monitoring and debugging tools
Task Tool See

Check the status of the DataScale SN30-2 RDU module

xrdutool

View xrdutool diagnostics and logs

Configure SNMP alerts for third-party rack components.

SNMP alerts

Set up SNMP alerts

Diagnose problems with logs.

OS logs, BMC logs, compiler logs, application logs

Viewing system logs

Check and manage SND, view SND logs.

SND (SambaNova Daemon)

SambaNova daemon (SND) diagnostics

Debug model compilation, running models, and third-party components

Misc. tools and logs

Debugging DataScale SN30 issues

If you cannot resolve the issues yourself, create a support case and include diagnostic materials. See View SN30-H BMC diagnostic information and logs.

7.2. View xrdutool diagnostics and logs

You use the xrdutool tool and logs to diagnose a DataScale SN30-2 issue and to collect information for SambaNova Support to triage an issue. The tool gets the status of the DataScale SN30-2 RDU module that the tool is run on.

Use the tool to check the overall status of the DataScale SN30-2 RDU module and of the hosted RDUs and memory. Follow these steps to examine the output on the power and fault status of the DataScale SN30-2 board:

  1. Log in to the DataScale SN30-2 RDU module’s BMC that is having problems:

    $ ssh root@<BMC_IP_Address>
    Password: <Enter root password>
  2. Run the xrdutool command:

    root@xrdu:~# xrdutool status
  3. Examine the output, which gives a quick view into the state of the DataScale SN30-2 RDU module along with two RDUs and the RDU controller. The output:

    • Shows whether any faults have been detected.

    • Shows the power state of the DataScale SN30-2 RDU module and of the RDU.

Here’s an example:

Power is on
RDU-C Release Version: 4.4.0
RDU-C BuildDate: 10.17 1654   DesignVer: 69   BoardID: 60
XRDU_0: STATUS
--------------------------------------------------------
SYSTEM :  rdu3    rdu2    rdu1    rdu0    stby    ps      pex0    pex1    sys     p3v3        mss_op_state   mss_log_level
           1       1       1       1       1       1       1       1       1       1               4               1
--------------------------------------------------------
RDU_0/D_0  0935a00001f1d6a4 102007b367359895     RDU_0/D_1  09a6c000012eda24 605007b367359895     ON. Please verify rdu_pwr_status[0] value to determine faults
--------------------------------------------------------
ENABLES:  vddo    pvpp            pvdd    pvddq           pvtt            pavddh  pavdd   vddc
           1       1               1       1               1               1       1       1
PWRGOOD:  vddo    pvpp0   pvpp1   pvdd0    pvdd1  pvddq0  pvddq1  pvtt0   pvtt1   pavddh  pavdd   vddc0   vddc1   vddc2   vddc3
           1       1       1       1       1       1       1       1       1       1       1       1       1       1       1
--------------------------------------------------------
RDU_1/D_0  09e9a00001a5dc64 502807b367359895     RDU_1/D_1  08e8200000bedd24 107007b367359895     ON. Please verify rdu_pwr_status[1] value to determine faults
--------------------------------------------------------
ENABLES:  vddo    pvpp            pvdd    pvddq           pvtt            pavddh  pavdd   vddc
           1       1               1       1               1               1       1       1
PWRGOOD:  vddo    pvpp0   pvpp1   pvdd0    pvdd1  pvddq0  pvddq1  pvtt0   pvtt1   pavddh  pavdd   vddc0   vddc1   vddc2   vddc3
           1       1       1       1       1       1       1       1       1       1       1       1       1       1       1
--------------------------------------------------------
PEX_0:   fpga_p0v8_pex_pgd2   pg_p1v25_pex   pg_p1v8_pex_pll   fpga_pg_p1v8_pex
               1               1               1               1
--------------------------------------------------------
PEX_1:   fpga_p0v8_pex_pgd2   pg_p1v25_pex   pg_p1v8_pex_pll   fpga_pg_p1v8_pex
               1               1               1               1
--------------------------------------------------------
rduc_pwr_status[0] = 0x7fff
rduc_pwr_status[1] = 0x7fff
pex_pwr_status[0] = 0x7f
pex_pwr_status[1] = 0x7f
power_status_aggregate = 0x7fff
Board Type: 3
NUM_RDUS: 2
NUM_DIE_PER_RDU: 2
NUM_DIES: 4

In addition to collecting diagnostic information from the SN30-2 RDU module directly, you can get health status of all the SN30-2 RDU modules in the SN30-8 node by using the SambaNova Fault Management (SNFM) utility that comes pre-installed on the host. See the SambaNova Fault Management (SNFM) User Guide in the SambaNova Runtime documentation in the SambaNova documentation portal (https://docs.sambanova.ai).

For details on diagnosing a DataScale SN30-2 RDU module’s BMC and on collecting the required diagnostic and log material, see KB article #1024, "DataScale SN30-2 Diagnostic Collection", in the SambaNova Support portal.

7.3. Set up SNMP alerts

To configure SNMP alerts for non-SambaNova components in the DataScale SN30 rack, see the vendor-specific documentation.

7.4. Viewing system logs

You can use the following log files to identify and resolve issues with the system or an application:

  • OS logs

  • BMC logs

  • SambaNova compiler logs

  • Application logs

7.4.1. OS logs

SambaNova does not alter the logs or log directories for Red Hat Enterprise Linux or Ubuntu. The /var/log/ directory contains most of the logs and other log tools such as journalctl.

7.4.3. SambaNova compiler logs

Additional logs for the compilers are available in a user-specified directory that was specified at the time the models were compiled. These logs are fairly low level and are requested by SambaNova Support to troubleshoot issues. For details, see Collect diagnostic materials for SambaNova Support.

You can use different compiler log verbosity settings to debug issues. See the SambaFlow Runtime document for details.

7.4.4. Runtime logs

The following log files related to SambaNova are in the /var/log/sambaflow/runtime/ directory:

sn.log

Logs related to SambaNova graph operations. Events received by the graph process and graph-specific events (including errors) that are not logged to snd.log.

snd.log

SambaNova daemon (SND) system logs. Summary of RDU resources and hardware error events.

Additional log events such as kernel logs (from the RDU driver module) go to dmesg(1).

You can use different log verbosity settings to get more logging details for the SambaFlowTM Runtime and other SambaFlow components. See "Changing Runtime Log Levels" in the SambaNova Runtime Guide.

7.5. SambaNova daemon (SND) diagnostics

The SambaNova daemon (SND) is running on the DataScale SN30-H host module and manages several critical pieces of the SambaNova operation. The SND is responsible for:

  • Loading and unloading the RDU drivers

  • Initializing RDU system resources

  • Managing hardware faults for the RDU system

  • Enabling the debugging of the RDU system’s hardware resources

The SND is required to run graphs and models because:

  • The SND handles the RDU drivers and the initialization of RDU resources.

  • The SND is aware of issues with RDU resources and can avoid problematic resources.

The SND starts automatically:

  • At boot time of the DataScale SN30-H OS and starts the discovery and initialization of the RDUs. This is why it is important to power on the DataScale SN30-2 RDU modules first, before powering on the SN30-H host module.

  • When the SambaFlow package is installed. In this case, the SND waits a few minutes after the installation for the RDU system discovery and initialization processes to complete.

7.5.1. Check SND status

To check the status of the SND, run the systemctl status snd command. Below is sample output showing what the command might return:

$ sudo systemctl status snd
● snd.service - SN Devices Service
     Loaded: loaded (/lib/systemd/system/snd.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/snd.service.d
             └─override.conf
     Active: active (running) since Wed 2022-10-19 07:10:10 PDT; 3h 24min ago
   Main PID: 5263 (snd)
      Tasks: 10 (limit: 629145)
     Memory: 164.9M
     CGroup: /system.slice/snd.service
             └─5263 /opt/sambaflow/bin/snd

7.5.2. Start, stop, and restart SND

You can start, stop, and restart the SND with the following commands:

To start the SND:

$ sudo systemctl start snd

To stop the SND:

$ sudo systemctl stop snd

To restart the SND:

$ sudo systemctl restart snd

7.5.3. Use SND for debugging

The SND CLI provides physical visibility into the entire DataScale SN30-8 system. This allows complete access to the RDU system for debugging, triage, and validation efforts.

The SND is also responds to error events that occur on the RDU and on the entire DataScale SN30-2 RDU module.

All logs from the SND are written to /var/log/sambaflow/runtime/snd.log. This log provides a summary of the RDU resources available to the system and includes any hardware error events that occur. The information is useful for diagnosing and resolving hardware issues.

7.6. Debugging DataScale SN30 issues

Troubleshooting might require that you debug issues with the following DataScale SN30 rack components:

  • Compilation of models

  • Running of models

  • Third-party components

7.7. Debug model compilation

For problems that occur while compiling models, run the following command and examine the logs that are generated in the user-specified output directory:

$ python <model_script.py> compile --output-folder=<output_directory>

You can set different levels of logging verbosity when you compile a model. See Collect diagnostic materials for SambaNova Support for best practice when creating a support case.

7.7.1. Debug running models

For problems that occur while running models, use these resources:

  • The /var/log/sambaflow/runtime/ log files

    These logs provide an initial glance into an issue that is occurring while running a model. If a problem does occur and is reproducible, enable more logging verbosity for SambaFlow Runtime. See the "Changing Runtime Log Levels" section of the SambaNova Runtime Guide for details.

  • The SambaNova Fault Management (SNFM) tool

    The SNFM tool provides a framework to

  • Monitor, log, and clear various faults associated with a DataScale SN30-2 RDU module

  • Provide corrective actions to recover from these faults.

This capability is built into the SambaNova daemon (SND) and installed as part of SambaFlow. See "SambaNova Fault Management (SNFM) User" in the SambaNova Runtime Guide for details.

7.7.2. Debug third-party components

For operational issues with the third-party components in the DataScale SN30 rack, see the vendor-specific documentation. For issues that require additional support or for questions related to troubleshooting, open a support case through SambaNova Support. See KB article #1017, "SambaNova Systems Support Best Practices," at https://support.sambanova.ai.

Do not open a case directly with the product vendor.

7.8. Collect diagnostic materials for SambaNova Support

When you open a support case, provide details on the issue that has occurred, and initial diagnostic materials. For collecting diagnostic materials, See the following KB articles in the SambaNova Support portal:

  • DataScale SN30-2 Diagnostic Collection: KB article #1024

  • DataScale SN30-H BMC Diagnostic Collection: KB article #1039

  • DataScale SN30-H (Red Hat Enterprise Linux) Diagnostic Collection: KB article #1039

  • DataScale SN30-H (Ubuntu) Diagnostic Collection: KB article #1039

  • Ethernet Data Switch Diagnostic Collection: KB Article #1053

  • Access Switch Diagnostic Collection: KB article #1053

  • Serial Console Server Diagnostic Collection: KB article #1121

  • PDU Diagnostic Collection: KB article #1120

7.9. View SN30-H BMC diagnostic information and logs

To quickly identify a system’s status and view diagnostic information and logs for the DataScale SN30-H BMC, follow these steps:

  1. Log in to the BMC’s Web UI and view the BMC dashboard.

    Diagnostic information

  2. For details on logs and pending events/deassertions, click the More info link in each box.

  3. To find more logs and reports, click Logs & Reports in the left pane and select a log.

    Logs & Reports item

See KB article #1039, “Diagnostic Data Collection Tool(samba_diag),” in the SambaNova Support portal (https://support.sambanova.ai) for details on:

  • Diagnosing a DataScale SN30-H host module’s BMC

  • Diagnosing the DataScale SN30-H host module in general

  • Collecting the required diagnostic materials and logs.

8. Back up and restore components

Use your site-specific guidelines and tools for backing up and restoring components of the DataScale® SN30 rack.

If you change the standard configuration of the networking equipment that is shipped to you, save the configuration changes you make to the devices. For details, see the SambaNova Day 1 Document and the KB articles listed below. You can find KB articles in the SambaNova Support portal at https://support.sambanova.ai.

8.1. Recover the Juniper access and data switch

For the process to recover the Juniper access switch and data switch, see the following KB articles:

  • Juniper Switch Password Recovery: KB article #1056

  • Juniper Switch Factory Reset Recovery: KB article #1056

  • Juniper Switch Saving Running Configuration: KB article #1056

8.2. Recover the Latronix serial console server

For the process to recover the Lantronix serial console server, including recovering the sysadmin password, see the following KB articles:

  • Lantronix Serial Console Server Password Recovery: KB article #1059

  • Lantronix Serial Console Server Factory Reset Recovery: KB article #1059

  • Lantronix Serial Console Server Saving Running Configuration: KB article #1059

8.3. Recover the DataScale SN30-H host module

If the DataScale SN30-H OS needs to be recovered, and the SN30-H host boot partitions are not damaged, contact SambaNova Support. Recovering the SN30-H OS to factory baseline might be possible and a faster recovery option than using the recovery ISOs.

For the processes to recover the DataScale SN30-H host module, see the following KB articles:

  • DataScale SN30-H OS Recovery Using the Recovery ISO – Ubuntu: KB article #1051

  • DataScale SN30-H OS Recovery Using the Recovery ISO – Red Hat: KB article #1099

  • DataScale SN30-H BMC Password Recovery: KB article #1021

  • DataScale SN30-H BMC Non-Corruption Recovery: KB article #1038

8.4. Recover the DataScale SN30-2 RDU module

For the process to recover the DataScale SN30-2 RDU module, refer to the following KB article:

  • SambaNova DataScale SN30-2 BMC Password Recovery: KB article #1049

8.5. Upload recovery configuration files

For the process to upload configuration files used as part of the recovery process for some of these components, see the following KB articles:

  • Uploading Configuration Files for Recovery: KB article #1055

  • Listing and Downloading Configuration Files for Recovery: KB article #1044

For questions concerning any of these recovery KB articles or for anything that is not covered here, open a support case through the SambaNova Support portal (https://support.sambanova.ai).