How SambaNova administrators set up Slurm

Slurm is an open-source cluster management and job scheduling system for Linux clusters. You can use Slurm to manage SambaNova hardware resources in the same way it manages CPU and GPU resources.

This page is for system administrators who install and configure Slurm. If you are a developer who wants to use Slurm with your SambaNova system, see How SambaNova developers can use Slurm.

Depending on how the SambaNova system was initially set up, system administrators might have to install and configure the SambaNova Slurm build (preferred) or modify an existing Slurm installation.
Then administrators Configure SambaNova GRES (Generic Resources)

Installation and configuration require superuser privileges.

SambaNova supported Slurm versions

Our Slurm plugins currently supports only Slurm 19.05.5 and 20.02.5. If you’re using a different version of Slurm in your environment, then you can still add GresType definitions to slurm.conf and specify GRES requirements, but RDU health check and update are not enabled. In that case, Slurm always assumes that all required resources are available and allocate those resources as requested.

If you’re not using one of those versions, you cannot install our sambanova-tools-slurm package. If you’re changing your existing version to an unsupported version, the sambanova-tools-slurm package no longer works.
If you want to use an unsupported version of Slurm, remove both the sambanova-tools-slurm and sambanova-runtime-plugin-slurm package. You can still define and specify the Generic Resource (GRES) requirements.

Option 1: Install and configure the SambaNova Slurm build

If you don’t have Slurm installed in your environment, you install and configure the SambaNova Slurm package and plugin.

Process overview

Follow the steps in the next section. Here’s an overview:

Create a munge user.
Install sambanova-tools-slurm and sambanova-runtime-plugin-slurm.
Create and customize slurm.conf.

Install the SambaNova Slurm build

To install the SambaNova Slurm build, follow these steps:

Create a munge user according to your organization’s IT policies (see munge installation guide for more information). One way to do this is:
```
$ adduser munge --disabled-password
```
(Optional) If MUNGE is not installed yet, install the MUNGE dependency for user authentication:
- For Ubuntu
  $ sudo apt install sambanova-deps-munge
- For Red Hat:
  $ sudo dnf install sambanova-deps-munge

Finalize MUNGE setup:

$ sudo chown munge: /etc/munge && sudo chown -R munge: /var/lib/munge && sudo chown -R munge: /var/log/munge
$ sudo -u munge /usr/sbin/mungekey --verbose
$ sudo chown munge: /etc/munge/munge.key
$ sudo systemctl enable munge && sudo systemctl start munge

For Red Hat, we use the Munge package provided by the OS. We’ve seen munged service fail to start on reboot because the system removes the /var/run/munge directory on reboot. One way to avoid this problem is to add the following to munge.service: RuntimeDirectory=munge RuntimeDirectoryMode=0755

Create a Slurm user according to your organization’s IT policies. See the Slurm installation guide for more information. One way to do this is:
```
$ adduser slurm --disabled-password
```
Install SambaNova Slurm package on the controller node and on each compute node.
- For Ubuntu:
  $ sudo apt install sambanova-tools-slurm
- For RHEL:
  $ sudo dnf install sambanova-tools-slurm
On each compute node in the cluster, install the SambaNova Slurm Generic Resource (GRES) plugin:
- For Ubuntu:
  $ sudo apt install sambanova-runtime-plugin-slurm
- For RedHat:
  $ sudo dnf install sambanova-runtime-plugin-slurm
If slurm.conf is not yet present, create the file.
- If you already have a slurm.conf file on other nodes, copy that configuration.
- Otherwise, create slurm.conf.

Set up Slurm by running the following commands:

$ sudo mkdir /var/spool/slurmd && sudo chown slurm: /var/spool/slurmd && sudo chmod 755 /var/spool/slurmd && \
sudo touch /var/log/slurmd.log && sudo chown slurm: /var/log/slurmd.log && sudo mkdir /var/spool/slurm_state && \
sudo chown slurm: /var/spool/slurm_state && sudo chmod 755 /var/spool/slurm_state && \
sudo touch /var/log/slurmctld.log && sudo chown slurm: /var/log/slurmctld.log && \
sudo touch /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log && \
sudo chown slurm: /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log

Finalize configuration. The exact commands are different for the controller or a node:
- To start on a controller, run:
  $ systemctl enable slurmctld $ systemctl start slurmctld
- To start on a node, run:
  $ systemctl enable slurmd $ systemctl start slurmd

Configure the new Slurm installation

The next step is to configure the compute nodes you wish Slurm to manage and make any other general Slurm configuration changes needed.

See the Configuration section of the Slurm Quick Start Administrator Guide for general Slurm information
See Configure SambaNova Slurm components for information specifying GRES definitions and scripts.

Option 2: Add the Slurm plugin to a Slurm installation

If you’re using Slurm in your environment and you want to add a new SambaNova node to your existing cluster, you can reuse and customize your existing slurm.conf file.

Process overview

If you have a supported Slurm version (19.05.5 or 20.02.5):

Install sambanova-tools-slurm and sambanova-runtime-plugin-slurm
Copy the existing slurm.conf to the SambaNova node and customize it.

If you do not have a supported Slurm version, then you can still add GresType definitions to slurm.conf and specify GRES requirements, but RDU health check and update are not enabled. In that case, Slurm always assumes that all required resources are available and allocates those resources as requested.

Install a supported version of Slurm on the SambaNova node.
Copy the existing slurm.conf to the SambaNova node and customize it.

Follow the steps in the next section.

Add the SambaNova Slurm plugin

Follow these steps to add SambaNova Slurm plugin:

On each compute node in the cluster, install the SambaNova Slurm Generic Resource (GRES) plugin:
- For Ubuntu:
  $ sudo apt install sambanova-runtime-plugin-slurm
- For RedHat:
  $ sudo dnf install sambanova-runtime-plugin-slurm
Restart the slurmd service on the compute nodes:
```
$ sudo systemctl restart slurmd
```

Configure SambaNova Slurm components

If you specify the expected number of RDUs and also check RDU health, you can prevent RDU resource oversubscription if one of the RDUs needs to be taken offline.

The Slurm RDU plugin checks the configuration file and validates the configuration. (if .. if ..)
TaskProlog and TaskEpilog scripts check RDU health.

Make these changes:

Open the slurm.conf file.
To support SambaNova Reconfigurable Data Units (RDUs), specify the Gres definitions.
```
GresTypes=rdu,rdu_tile,rdu_mem
```
To enable the health checks, add the following lines to slurm.conf:
```
EpilogSlurmctld=/opt/sambaflow/slurm/python/sn_inventory_update
TaskProlog=/opt/sambaflow/slurm/python/sn_inventory_check
TaskEpilog=/opt/sambaflow/slurm/python/sn_inventory_check
```
The three scripts TaskProlog, TaskEpilog, and EpilogSlurmctld monitor the RDU health on compute nodes before and after each job is executed. The TaskEpilog script compares the inventory of healthy RDU resources recorded during the TaskProlog with the inventory after the job has executed.
Restart slurmctld for the change to take effect.

Here’s an example slurm.conf file:

Example slurm.conf file

# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=labhost1.example.com
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm_state
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=debug
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug
SlurmdLogFile=/var/log/slurmd.log
#
#
# Scripts
EpilogSlurmctld=/opt/sambaflow/slurm/python/sn_inventory_update
TaskProlog=/opt/sambaflow/slurm/python/sn_inventory_check
TaskEpilog=/opt/sambaflow/slurm/python/sn_inventory_check


#
#
# Generic resources
GresTypes=rdu,rdu_tile,rdu_mem
#
#
# COMPUTE NODES


# g18 has 8 perfect RDUs, replace xx with rdu_mem
NodeName=labhost1.example.com CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 RealMemory=1031737 State=UNKNOWN Gres=rdu:8,rdu_mem:no_consume:100G
NodeName=labhost2.example.com CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 RealMemory=1031737 State=UNKNOWN Gres=rdu:8,rdu_mem:no_consume:100G


PartitionName=test Nodes=labhost1.example.com,labhost2.example.com, Default=YES MaxTime=INFINITE State=UP

If the inventory of the Slurm cluster has changed:

Change the Slurm GRES definition to avoid oversubscription on the node.
TaskEpilog detects the change and drains the node with a specific reason.
EpilogSlurmctld will notice and update the GRES state at job termination.

Without this update, it is possible for Slurm to oversubscribe the node and jobs may fail.

Because a privileged user must run the Slurm scontrol update command, we provide the sn_gres_update setuid executable to update the GRES resource definition for the compute node. This executable updates the GRES definition for only the node on which it is executed.

Slurm configuration troubleshooting

The Slurm cluster already uses scripts

If the Slurm cluster already uses one or more of the scripts (TaskProlog, TaskEpilog and/EpilogSlurmctld), create a shell script that includes the existing script and the SambaNova-provided sn_inventory_check and sn_inventory_update.

The same sn_inventory_check script serves as both prolog and epilog and uses the context environment variables set by Slurm to vary behavior appropriately.

Configure SambaNova GRES (Generic Resources)

Generic Resources (GRES) are resources associated with a specific node that can be allocated to jobs and steps. SambaNova has defined some resources to enable Slurm to manage SambaNova RDUs.

Manage SambaNova RDUs with the rdu GRES

We have defined a new Slurm generic resource (GRES) type so that Slurm can manage SambaNova RDUs. On each compute node, a Slurm GRES resources can be specified:

either in the shared slurm.conf file
or in a gres.conf file.

The specification depends on the hardware you are using. For example, here are some specifications for SN10 and SN30:

SN10

Gres=rdu:8, rdu_tile:32, rdu_mem

SN30

Gres=rdu:16,rdu_tile:64, rdu_mem

How you specify the rdu GRES

For each SambaNova host in the cluster do one of the following:

Add a line to slurm.conf specifying the CPU and RDU resources of the system.
Or specify the CPU resource in slurm.conf and the RDU GRES resources in the gres.conf file.

For example, to add a SambaNova 10-8 system named sn101 to the cluster, add the following line to the slurm.conf file:

NodeName=sn101 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN Gres=rdu:8

The Gres=rdu:8 specification tells Slurm that this node expects 8 available RDUs.

How slurmd uses the rdu GRES

When the slurmd(8) daemon on a node starts up, it uses the RDU GRES plugin to validate the inventory of healthy RDUs on the system against the specification in slurm.conf.

If configured count is equal or less than system detected count, configured count will be used.
If configured count is greater than system detected count, system detected count will be used.
If the system detected count is 0, slurmd will fail.

Manage RDU tiles with the rdu_tile GRES

If you specify the rdu_tile GRES, then Slurm can manage SambaNova resources at tile level.

On any node, you can configure either rdu or rdu_tile. If you want to manage RDU resources at the tile level, do not specify the rdu GRES.

How you specify the rdu_tile GRES

Adding rdu_tile to a compute node is similar to adding rdu e.g. in slurm.conf:

NodeName=sn101 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN Gres=rdu_tile:32

In the example above, the Gres=rdu_tile:32 specification tells Slurm that this node expects 32 available RDU tiles.

How slurmd uses the rdu_tile GRES

When the slurmd(8) daemon on a node starts up, it uses the RDU_TILE GRES plugin to validate the inventory of healthy RDU tiles on the system against the specification in slurm.conf or gres.conf.

If the configured count is equal or less than system detected count, configured count will be used.
If the configured count is greater than system detected count, the system detected count will be used.
If the system detected count is 0, slurmd will fail.

Best practices for using multiple GRES definitions

To use Slurm with SambaNova effectively, follow these best practices:

Separate the compute nodes that manage at the rdu level and the nodes that manage at the rdu_tile level into different partitions.
When submitting a job, ensure that only one GRES type (rdu or rdu_tile) is specified so that Slurm can find the right node with the available resources to dispatch the job to.
For jobs that require 2 tiles, create separate partitions for each resource requirement (2V/2H are not identical).

Configure jobs that require the same resource to run in the same partition. For example, 1T jobs should run on a partition managed at rdu_tile level; 2V jobs should run on a different partition managed at rdu_tile level; 4T jobs should run an another partition managed at rdu level. Otherwise you can encounter random resource allocation failures that cause some jobs to fail occasionally. When submitting jobs to these partitions, always specify the GRES requirements: --gres=rdu:n or --gres=rdu_tile:n.

Manage RDU memory with the rdu_mem GRES

The rdu_mem GRES allows you to specify the required amount of device memory for each RDU on a compute node. For each job, you can specify how much RDU memory that job requires to run and Slurm ensures that that job is run only on hosts with sufficient memory.

Currently we assume that every RDU on a compute node has the same amount of device memory.

How you specify the rdu_mem GRES

Declare the rdu_mem GRES as no_consume so that the rdu_mem resource remains unchanged when jobs that require rdu_mem are scheduled to run. RDU memory cannot be shared between RDUs.

Here’s an example for adding rdu_mem to a compute node:

NodeName=sn101 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN Gres=rdu:8,rdu_mem:no_consume:64G

In the example above, rdu_mem:no_consume:64G tells Slurm that this node expects 64G of device memory available.

When the slurmd(8) daemon on that node starts, it uses the RDU_MEM GRES plugin to validate the minimum RDU device memory on the system against the specification in slurm.conf or gres.conf.

If the configured amount is equal or less than system detected amount, the configured amount will be used.
If the configured amount is greater than system detected amount, the system detected amount will be used.
If the system detected amount is 0, slurmd will fail.