How SambaNova developers can use Slurm

SambaNova developers can specify RDU requirements and RDU tile requirements, and RDU memory requirements.

If you have to call scancel on RDU jobs, use the --signal=TERM option to allow the SambaFlow runtime to gracefully terminate applications.

Specify RDU resource requirements in batch scripts

Jobs that are submitted to Slurm should specify the resources they require so that Slurm can ensure these resources are available when the job is executed.

You can:

Specify CPU, GPU, and other resources using standard Slurm commands and their arguments. See Quick Start User Guide for an introduction to Slurm.
Add SambaNova resource requirements in a similar manner.

Use sinfo %G for information on your local setup. See https://slurm.schedmd.com/sinfo.html#OPT_%G.

Specify RDU requirements

Specify RDU requirements like any other generic resources. For example, to indicate that an application submitted with a job batch script via sbatch requires four RDUs, add the following line to the batch script:

#SBATCH --gres=rdu:4

Before dispatching the job, Slurm will ensure that any host chosen to execute this job has at least four RDUs available.

Specify RDU tile requirements

To indicate that an application submitted via sbatch requires one rdu_tile, add the following line to the batch script.

#SBATCH --gres=rdu_tile:1

Before dispatching the job, Slurm will ensure that any host chosen to execute this job has at least one RDU tile available.

Only one of gres types rdu and rdu_tile can be specified for a single job and only 1T, 2V or 2H jobs should be submitted to compute nodes being managed at the rdu_tile level. And there should be a different partition for each different type. When submitting jobs to a partition that is managed at the tile level, always specify --gres=rdu_tile:n.

Specify RDU memory requirements

Use --gres=rdu_mem to specify the RDU memory requirements of an application. To indicate that an application requires 200GiB of device memory, add the following line to the batch script:

#SBATCH --gres=rdu_mem:200G

Before dispatching the job, Slurm will ensure that any host chosen to execute this job has at least 200GiB of device memory per RDU configured.

Python library

Computation graphs that are executed on RDUs are stored in PEF files, which specify the required resources. SambaNova provides a Python library to read these resource requirements and some example Slurm job submission code in Python to show how to use this library to automate the composition of the correct GRES specifications.

Slurm feeder script

Before you can use the Slurm feeder script, you have to install Slurm.

If Slurm is not yet installed in your environment, see Install and configure the SambaNova Slurm build.
If Slurm is already installed in your environement, see Add the Slurm plugin to a Slurm installation.

The slurm_feeder script checks resource requirements and compares them to what’s available. The script reads the requirements from the file PEF and constructs a batch script that specifies the resources, and submits the job.

If no resource that meets the requirements is currently available, the job will be in a ‘PENDING’ state until it can be scheduled.
If the resource requirement exceeds resource specification of all nodes, the job fails instantly.

Prerequisites

Before you run the feeder script, make sure your environment meets prerequisites:

Python 3.7 is required.
A supported version of SambaFlow is installed.
/path (used in the script below) must be accessible on the compute node.

Script

$ python3 /opt/sambaflow/slurm/python/slurm_feeder -c sbatch -p /path/logreg_00_03_25.pef -a /path/logreg.py --python-arg="run"

Arguments

To see all the arguments, run slum_feeder -h on your system. A Python application or a shell script and its corresponding PEF file are required arguments:

-p PEF_FILE, --pef-file PEF_FILE — Path to pef file to execute
-c {srun,sbatch,print_resource}, --command {srun,sbatch,print_resource} Slurm command srun or sbatch or print_resource. You can:
- Use the srun command directly.
- Or use the sbatch command to queue up the job.
  
  To better utilize Slurm scheduing, use sbatch.

Running MPI jobs under Slurm

We use MPICH 3.3+ to run data-parallel applications.

The sambaflow package installs all the necessary dependencies to run data-parallel applications. The key additional packages are:

sambanova-deps-pytorch
sambanova-deps-mpich
sambaflow-deps-venv

If any of these packages are missing, remove and reinstall SambaFlow, as follows:

For Ubuntu:
```
$ sudo apt install sambaflow
```
For Red Hat:
```
$ sudo dnf install sambaflow
```

To run under Slurm with sbatch, for example, create an appropriate batch file and run the app with mpirun but omit any specification of number of processes, host list, or host file. This information will be provided by Slurm. Do not invoke mpirun with srun.

Example batch file:

# !/bin/bash
# SBATCH --gres=rdu_tile:1
$ source /opt/sambaflow/venv/bin/activate
$ cd /path/to/app
/opt/mpich-3.4.3/bin/mpirun -hosts=sn101 python ffn_mnist_ms.py run -p ffn_mnist_ms.pef --data-parallel -n 1000 -e 1#

Submit with sbatch command as usual, where you can include a list of hosts to use with -w and the number of tasks (processes) with -n:

$ sbatch -w sn101 -n 2 ~/batch_dp.sh

Or use the slurm_feeder script:

$ python /opt/sambaflow/slurm/python/slurm_feeder -c sbatch --single-tile --mpirun -p ffn_mnist_ms.pef -a ffn_mnist_ms.py -w sn101 --python-arg=run --data-parallel -n 1000 -e 1 -b 2 -mb 1

To ensure higher availability of RDU resources with data-parallel applications, set the KillOnBadExit param in the Slurm configuration. This change causes all job steps that exit with a non-zeroe exit code to be forcibly terminated. The jobs no longer hold on to resources and wait for the timeout. Alternatively, if using srun directly, you can use the -K argument. See the Slurm srun documentation for details.