How SambaNova administrators set up Slurm
Slurm is an open-source cluster management and job scheduling system for Linux clusters. You can use Slurm to manage SambaNova hardware resources in the same way it manages CPU and GPU resources.
This page is for system administrators who install and configure Slurm. If you are a developer who wants to use Slurm with your SambaNova system, see How SambaNova developers can use Slurm.
-
Depending on how the SambaNova system was initially set up, system administrators might have to install and configure the SambaNova Slurm build (preferred) or modify an existing Slurm installation.
-
Then administrators Configure SambaNova GRES (Generic Resources)
Installation and configuration require superuser privileges. |
SambaNova supported Slurm versions
Our Slurm plugins currently supports only Slurm 19.05.5 and 20.02.5. If you’re using a different version of Slurm in your environment, then you can still add GresType definitions to slurm.conf and specify GRES requirements, but RDU health check and update are not enabled. In that case, Slurm always assumes that all required resources are available and allocate those resources as requested.
|
-
If you’re not using one of those versions, you cannot install our
sambanova-tools-slurm
package. If you’re changing your existing version to an unsupported version, thesambanova-tools-slurm
package no longer works. -
If you want to use an unsupported version of Slurm, remove both the
sambanova-tools-slurm
andsambanova-runtime-plugin-slurm
package. You can still define and specify the Generic Resource (GRES) requirements.
Option 1: Install and configure the SambaNova Slurm build
If you don’t have Slurm installed in your environment, you install and configure the SambaNova Slurm package and plugin.
Process overview
Follow the steps in the next section. Here’s an overview:
-
Create a
munge
user. -
Install
sambanova-tools-slurm
andsambanova-runtime-plugin-slurm
. -
Create and customize
slurm.conf
.
Install the SambaNova Slurm build
To install the SambaNova Slurm build, follow these steps:
-
Create a
munge
user according to your organization’s IT policies (see munge installation guide for more information). One way to do this is:$ adduser munge --disabled-password
-
(Optional) If MUNGE is not installed yet, install the MUNGE dependency for user authentication:
-
For Ubuntu
$ sudo apt install sambanova-deps-munge
-
For Red Hat:
$ sudo dnf install sambanova-deps-munge
-
-
Finalize MUNGE setup:
$ sudo chown munge: /etc/munge && sudo chown -R munge: /var/lib/munge && sudo chown -R munge: /var/log/munge $ sudo -u munge /usr/sbin/mungekey --verbose $ sudo chown munge: /etc/munge/munge.key $ sudo systemctl enable munge && sudo systemctl start munge
For Red Hat, we use the Munge package provided by the OS. We’ve seen munged service fail to start on reboot because the system removes the /var/run/munge directory
on reboot. One way to avoid this problem is to add the following tomunge.service
:RuntimeDirectory=munge
RuntimeDirectoryMode=0755
-
Create a Slurm user according to your organization’s IT policies. See the Slurm installation guide for more information. One way to do this is:
$ adduser slurm --disabled-password
-
Install SambaNova Slurm package on the controller node and on each compute node.
-
For Ubuntu:
$ sudo apt install sambanova-tools-slurm
-
For RHEL:
$ sudo dnf install sambanova-tools-slurm
-
-
On each compute node in the cluster, install the SambaNova Slurm Generic Resource (GRES) plugin:
-
For Ubuntu:
$ sudo apt install sambanova-runtime-plugin-slurm
-
For RedHat:
$ sudo dnf install sambanova-runtime-plugin-slurm
-
-
If
slurm.conf
is not yet present, create the file.-
If you already have a
slurm.conf
file on other nodes, copy that configuration. -
Otherwise, create
slurm.conf
.
-
-
Set up Slurm by running the following commands:
$ sudo mkdir /var/spool/slurmd && sudo chown slurm: /var/spool/slurmd && sudo chmod 755 /var/spool/slurmd && \ sudo touch /var/log/slurmd.log && sudo chown slurm: /var/log/slurmd.log && sudo mkdir /var/spool/slurm_state && \ sudo chown slurm: /var/spool/slurm_state && sudo chmod 755 /var/spool/slurm_state && \ sudo touch /var/log/slurmctld.log && sudo chown slurm: /var/log/slurmctld.log && \ sudo touch /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log && \ sudo chown slurm: /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log
-
Finalize configuration. The exact commands are different for the controller or a node:
-
To start on a controller, run:
$ systemctl enable slurmctld $ systemctl start slurmctld
-
To start on a node, run:
$ systemctl enable slurmd $ systemctl start slurmd
-
Configure the new Slurm installation
The next step is to configure the compute nodes you wish Slurm to manage and make any other general Slurm configuration changes needed.
-
See the Configuration section of the Slurm Quick Start Administrator Guide for general Slurm information
-
See Configure SambaNova Slurm components for information specifying GRES definitions and scripts.
Option 2: Add the Slurm plugin to a Slurm installation
If you’re using Slurm in your environment and you want to add a new SambaNova node to your existing cluster, you can reuse and customize your existing slurm.conf
file.
Process overview
If you have a supported Slurm version (19.05.5 or 20.02.5):
-
Install
sambanova-tools-slurm
andsambanova-runtime-plugin-slurm
-
Copy the existing
slurm.conf
to the SambaNova node and customize it.
If you do not have a supported Slurm version, then you can still add GresType definitions to slurm.conf
and specify GRES requirements, but RDU health check and update are not enabled. In that case, Slurm always assumes that all required resources are available and allocates those resources as requested.
-
Install a supported version of Slurm on the SambaNova node.
-
Copy the existing
slurm.conf
to the SambaNova node and customize it.
Follow the steps in the next section.
Add the SambaNova Slurm plugin
Follow these steps to add SambaNova Slurm plugin:
-
On each compute node in the cluster, install the SambaNova Slurm Generic Resource (GRES) plugin:
-
For Ubuntu:
$ sudo apt install sambanova-runtime-plugin-slurm
-
For RedHat:
$ sudo dnf install sambanova-runtime-plugin-slurm
-
-
Restart the
slurmd
service on the compute nodes:$ sudo systemctl restart slurmd
Configure SambaNova Slurm components
If you specify the expected number of RDUs and also check RDU health, you can prevent RDU resource oversubscription if one of the RDUs needs to be taken offline.
-
The Slurm RDU plugin checks the configuration file and validates the configuration. (if .. if ..)
-
TaskProlog and TaskEpilog scripts check RDU health.
Make these changes:
-
Open the
slurm.conf
file. -
To support SambaNova Reconfigurable Data Units (RDUs), specify the Gres definitions.
GresTypes=rdu,rdu_tile,rdu_mem
-
To enable the health checks, add the following lines to
slurm.conf
:EpilogSlurmctld=/opt/sambaflow/slurm/python/sn_inventory_update TaskProlog=/opt/sambaflow/slurm/python/sn_inventory_check TaskEpilog=/opt/sambaflow/slurm/python/sn_inventory_check
The three scripts
TaskProlog
,TaskEpilog
, andEpilogSlurmctld
monitor the RDU health on compute nodes before and after each job is executed. TheTaskEpilog
script compares the inventory of healthy RDU resources recorded during theTaskProlog
with the inventory after the job has executed. -
Restart
slurmctld
for the change to take effect.
Here’s an example slurm.conf
file:
Example slurm.conf
file
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=labhost1.example.com
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm_state
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=debug
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug
SlurmdLogFile=/var/log/slurmd.log
#
#
# Scripts
EpilogSlurmctld=/opt/sambaflow/slurm/python/sn_inventory_update
TaskProlog=/opt/sambaflow/slurm/python/sn_inventory_check
TaskEpilog=/opt/sambaflow/slurm/python/sn_inventory_check
#
#
# Generic resources
GresTypes=rdu,rdu_tile,rdu_mem
#
#
# COMPUTE NODES
# g18 has 8 perfect RDUs, replace xx with rdu_mem
NodeName=labhost1.example.com CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 RealMemory=1031737 State=UNKNOWN Gres=rdu:8,rdu_mem:no_consume:100G
NodeName=labhost2.example.com CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 RealMemory=1031737 State=UNKNOWN Gres=rdu:8,rdu_mem:no_consume:100G
PartitionName=test Nodes=labhost1.example.com,labhost2.example.com, Default=YES MaxTime=INFINITE State=UP
If the inventory of the Slurm cluster has changed:
-
Change the Slurm GRES definition to avoid oversubscription on the node.
-
TaskEpilog
detects the change and drains the node with a specific reason. -
EpilogSlurmctld
will notice and update the GRES state at job termination.
Without this update, it is possible for Slurm to oversubscribe the node and jobs may fail.
Because a privileged user must run the Slurm scontrol update
command, we provide the sn_gres_update
setuid executable to update the GRES resource definition for the compute node. This executable updates the GRES definition for only the node on which it is executed.
Slurm configuration troubleshooting
The Slurm cluster already uses scripts
If the Slurm cluster already uses one or more of the scripts (TaskProlog
, TaskEpilog
and/EpilogSlurmctld
), create a shell script that includes
the existing script and the SambaNova-provided sn_inventory_check
and sn_inventory_update
.
The same sn_inventory_check
script serves as both prolog and epilog and uses the context environment variables set by Slurm to vary behavior appropriately.
Configure SambaNova GRES (Generic Resources)
Generic Resources (GRES) are resources associated with a specific node that can be allocated to jobs and steps. SambaNova has defined some resources to enable Slurm to manage SambaNova RDUs.
Manage SambaNova RDUs with the rdu GRES
We have defined a new Slurm generic resource (GRES) type so that Slurm can manage SambaNova RDUs. On each compute node, a Slurm GRES resources can be specified:
-
either in the shared
slurm.conf
file -
or in a
gres.conf
file.
The specification depends on the hardware you are using. For example, here are some specifications for SN10 and SN30: |
SN10
Gres=rdu:8, rdu_tile:32, rdu_mem
SN30
Gres=rdu:16,rdu_tile:64, rdu_mem
How you specify the rdu GRES
For each SambaNova host in the cluster do one of the following:
-
Add a line to
slurm.conf
specifying the CPU and RDU resources of the system. -
Or specify the CPU resource in
slurm.conf
and the RDU GRES resources in thegres.conf
file.
For example, to add a SambaNova 10-8 system named sn101
to the cluster, add the following line to the slurm.conf
file:
NodeName=sn101 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN Gres=rdu:8
The Gres=rdu:8
specification tells Slurm that this node expects 8 available RDUs.
How slurmd uses the rdu GRES
When the slurmd(8)
daemon on a node starts up,
it uses the RDU GRES plugin to validate
the inventory of healthy RDUs on the system against the specification in slurm.conf
.
-
If configured count is equal or less than system detected count, configured count will be used.
-
If configured count is greater than system detected count, system detected count will be used.
-
If the system detected count is 0, slurmd will fail.
Manage RDU tiles with the rdu_tile GRES
If you specify the rdu_tile
GRES, then Slurm can manage SambaNova resources at tile level.
On any node, you can configure either rdu or rdu_tile . If you want to manage RDU resources at the tile level, do not specify the rdu GRES.
|
How you specify the rdu_tile GRES
Adding rdu_tile
to a compute node is similar to adding rdu
e.g. in slurm.conf
:
NodeName=sn101 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN Gres=rdu_tile:32
In the example above, the Gres=rdu_tile:32
specification tells Slurm that this node expects 32 available RDU tiles.
How slurmd uses the rdu_tile GRES
When the slurmd(8)
daemon on a node starts up, it uses the RDU_TILE GRES plugin to validate
the inventory of healthy RDU tiles on the system against the specification in slurm.conf
or gres.conf
.
-
If the configured count is equal or less than system detected count, configured count will be used.
-
If the configured count is greater than system detected count, the system detected count will be used.
-
If the system detected count is 0, slurmd will fail.
Best practices for using multiple GRES definitions
To use Slurm with SambaNova effectively, follow these best practices:
-
Separate the compute nodes that manage at the
rdu
level and the nodes that manage at therdu_tile
level into different partitions. -
When submitting a job, ensure that only one GRES type (
rdu
orrdu_tile
) is specified so that Slurm can find the right node with the available resources to dispatch the job to. -
For jobs that require 2 tiles, create separate partitions for each resource requirement (2V/2H are not identical).
Configure jobs that require the same resource to run in the same partition. For example, 1T jobs should run on a partition managed at rdu_tile level; 2V jobs should run on a different partition managed at rdu_tile level; 4T jobs should run an another partition managed at rdu level. Otherwise you can encounter random resource allocation failures that cause some jobs to fail occasionally. When submitting jobs to these partitions, always specify the GRES requirements: --gres=rdu:n or --gres=rdu_tile:n .
|
Manage RDU memory with the rdu_mem GRES
The rdu_mem
GRES allows you to specify the required amount
of device memory for each RDU on a compute node. For each job, you can specify how much RDU memory that job requires to run and Slurm ensures that that job is run only on hosts with sufficient memory.
Currently we assume that every RDU on a compute node has the same amount of device memory.
How you specify the rdu_mem GRES
Declare the rdu_mem
GRES as no_consume
so that the rdu_mem
resource remains unchanged when jobs that require rdu_mem
are scheduled to run. RDU memory cannot be shared between RDUs.
Here’s an example for adding rdu_mem
to a compute node:
NodeName=sn101 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN Gres=rdu:8,rdu_mem:no_consume:64G
In the example above, rdu_mem:no_consume:64G
tells Slurm that this node expects 64G of device memory available.
When the slurmd(8)
daemon on that node starts,
it uses the RDU_MEM GRES plugin to validate the minimum RDU device memory on the system against the specification in slurm.conf
or gres.conf
.
-
If the configured amount is equal or less than system detected amount, the configured amount will be used.
-
If the configured amount is greater than system detected amount, the system detected amount will be used.
-
If the system detected amount is 0, slurmd will fail.