Compiler argument reference

The SambaNova workflow includes a compilation step. Compilation generates a dataflow graph of the model, which is similar to a PyTorch computational graph, encapsulated as a PEF file. You submit the PEF file when you do training and inference runs. See Workflows.

This doc page is a reference to commonly used compiler arguments. You can experiment with most of these arguments yourself, but some are used only when you’re working with SambaNova Support.

The --help output includes commonly used arguments by default. If your model includes the dev_mode=TRUE argument or if you pass in --debug from the command line, the --help output includes experimental arguments that are currently supported but are subject to change without notice.

General compiler arguments

--pef-name <filename>

Name of the PEF file that the compiler generates and of the subdirectory for other compilation artifacts. By default the compiler uses the compilation timestamp and the process ID to name the subdirectory.

Use this argument to give your PEF file a meaningful name. For example, use today’s date in the name or use an inf extension for compilation for inference. If you experiment with different sets of hyperparameters, consider using them in the name. For example, if you are changing batch sizes in different training runs, add -b32 or -b16 to the PEF name to note the batch. For example:

python $HOME/sambaflow-apps/starters/ compile --pef-name logreg-0923


Specify --inference to compile and generate a PEF for inference. If you specify this argument, the compiler performs only a forward pass and doesn’t perform certain optimizations. See How model compilation works.

Just as with compilation for a training run, you can specify a PEF file and other compiler arguments. For example:

$ python compile --inference --pef-name=lenet-compile

--output-folder <folder-name>

Optional output folder. The compiler places the PEF file and log files in the output folder, which defaults to ./out/<pef-name>. To set the folder explicitly, run a command like this:

python compile --pef-name=lenet --output-folder=out_test7

-b <size>, --batch-size <size>

Informs the compiler which batch size will later be used during training. Set batch-size to 4, 8, 16, 32, or even higher to support more efficient training. The highest value you can use depends on the model and on available hardware resources. With different batch sizes your training might go faster or slower (to achieve the target accuracy). It usually takes some experimentation to find the right batch size for a model.

python compile --pef-name=lenet-1023 --batch-size 4

Log management arguments

You cannot currently set the log level when running the compile command. You can only switch verbose or debug logging on or off.

-v, --verbose

Shows verbose log output, similar to the --debug output.

When you run the compiler with --verbose, specify an output directory. The information about the location of the generated PEF file is no longer at the end of the output. For example:

python compile --pef-name=logreg-1023 --output-folder=out-1023 -v


When you work on a model with SambaNova Support, they might ask you to start the model in debug mode. In debug mode, the compiler sends more messages to stdout (and the logs). In addition, the --help output shows some arguments that are customarily used only with customer support.

python compile --pef-name=logreg-1023 --debug

--log-dir <directory-name>

Specifies a non-default directory where all logs with warnings are sent.

Only log files with warnings are sent to log-dir. All other logs are still sent to the output folder.
python compile --pef-name=logreg-1023 --log-dir=logs-1023

Compiler optimization arguments

You can specify an optimization level for the compiler. SambaFlow compiler overview explains the effect of each level.


When you specify -o0, each PyTorch operator is compiled independently.

This is the safest option, but because the compiler doesn’t perform optimizations, training and inference take longer than with o3. See Compiler optimization modes for details.

Here’s a very simple example. We’re working on additional examples.

python compile --pef-name=lenet-1123 -o0


With -o1, PyTorch operators are fused together into subgraphs. Each subgraph is compiled independently.

See Compiler optimization modes for details.

If you don’t specify --optimization-rules, the compiler behavior is the same as with -o0.
python --pef-name=lenet-1123 compile -o1 --optimization-rules /opt/sambaflow/apps/nlp/my-custom-rule.yaml

--optimization-rules <path-to-rules-file>.yaml

Optimization rules .yaml file to use with o1.


Enable compiler optimizations with o1. This argument is currently required when you compile with o1. We expect to remove this argument soon.


With -o3, the compiler has a global view of the entire graph. With this release (1.17), -o3 is the default.

This option usually has the longest compile time but fast runtime performance when used with model-specific HD files. Because the compiler attempts to optimize the whole graph, compilation might fail in some cases.

With -o3, you can optionally annotate subgraphs with `--enable-hypersection. In that case, each annotated subgraph is compiled independently. If there are duplicate subgraphs, only one is compiled and reused.

--compiler-mode <name>

Specifies the compiler mode. Using this flag with the right model type improves performance.

nlp is currently the only supported option.


If you’re running in o3 mode (the default), then you can annotate your model’s Python code to tell the compiler about duplicate subgraphs and get the performance improvements. Use this option to enable this optimization.

This option is used only conjunction with o3 compiler mode and usually when working with SambaNova support. In the future, expect to use o1 mode and operator fusion rule yaml files.

--resources-scaling-factors <factors>

Sometimes the compiler underestimates or overestimates the RDU resources that are needed for some decisions. Overestimation can results in compilation failures and underestimation can result in bad performance. If compilation fails, you can use this flag to force the compiler to assume it has fewer resources available than it has.

Specify 3 or 4 floats. A float of 1.0 means that the compiler can see all avaiiable resources.

  • Three floats: scaling factor for forward, backward and optimizer graphs

  • Four floats: scaling factor for forward, backward, gradient normalization and optimizer graphs

For example:

python compile --pef-name=lenet1223 --resources-scaling-factors 1 0.8 0.8

The compiler assumes that it can use all available resources for forward graphs and 80% for backward and optimizer graphs.


Use this option if you have a legacy model that relies on an earlier version of the compiler and you see suboptimal compilation.

python compile --pef-name=1223 --mac-v2

Hardware configuration arguments

--arch [native|sn10|sn20|sn30]

Allows you to compile for a different target architecture. For example, if you’re compiling on an SN30 system but expect to run the model on an SN20 system, you can use this flag.

Default is native, that is, the compiler targets the architecture of the hardware that you’re running on.

The options are sn20, sn30 etc. You cannot use SN20, SN30 etc.
python compile --pef-name=logreg-0923 --arch=sn20

Performs compilation so the PEF runs on an SN20 system even if you’re compiling on an SN30 system or on a CPU-only node.

Tensor parallel arguments

The following argument let you control tensor parallel behavior. See How to use tensor parallel mode (Beta) for details, including an example of an operator fusion yaml file to use together with tensor parallel.

--tensor-parallel batch|weight

Instructs the compiler to run in tensor parallel mode.

  • Batch mode splits tensors on the batch dimension. If the data tensors are larger than weight tensors, then batch mode has better performance.

  • Weight mode splits tensors on the dimension of the weight tensor. If the weight tensors are larger than data tensors, weight mode has better performance.

--n-chips [1|2|8]

Specifies the number of chips, that is, RDUs. Tensor parallel works only on 1 node, that is, up to 8 RDUs on SN10 2PDU (aka SN20) and SN30 systems. Defaults to 1 RDU.

Data parallel arguments


Causes the compiler to add the gather and reduce sections and buffers to the dataflow graph to support data parallel operation. See How to use data parallel mode for some prerequisites and best practices.

python compile --data-parallel -ws 2 --pef-name=logreg-1223

--world-size <integer>, -ws <integer>

Defines the minimum number of application replicas to be launched when the model is trained in data parallel mode. For compilation, set the value to 2. The actual number of replicas to be launched is defined at runtime.

python compile --data-parallel -ws 2 --pef-name=logreg-1223

For use with Customer Support

The following options are included in the compile --help output by default, but are reserved for use with SambaNova Support.

--compiler-configs-file COMPILER_CONFIGS_FILE

--mac-human-decision MAC_HUMAN_DECISION

--grad-accumulation-steps GRAD_ACCUMULATION_STEPS

--num-spatial-batches NUM_SPATIAL_BATCHES

--model-parallel (requires a human decision file)

--n-chips <integer> (use only with model-parallel, which is for use with Customer Support only)

Deprecated argument

The following argument is deprecated and will not be supported in future releases.

  • --o1-rule-list <yaml-file>. Starting with 1.17, this argument and related options are deprecated. Use the new --optimization-rules argument, discussed in Compiler optimization modes instead.