How to use tensor parallel mode (Beta)

Internally, SambaFlow supports several types of model parallelization. Model parallelization makes running the model (training, fine-tuning, etc) more efficient. You have control over parallelization in these ways:

  • Data parallel mode: You can compile and run the model in data parallel mode, discussed here. Some applications run much faster in data parallel mode. See How to use data parallel mode.

  • Tensor parallel mode: You can instruct the compiler to perform tensor parallel optimization, which also result in more efficient when you start running the model. Enhancements to tensor parallel compilation in release 1.18 automate and simplify compiling a model to run on multiple RDUs. Using tensor parallel during compilation:

    • Speeds up performance during training, fine tuning, and inference.

    • Helps in situations where large models exceed the memory limit of a single RDU.

What is tensor parallel compilation?

With tensor parallel compilation, some dimensions of tensors are split across RDUs. Users can control how many RDUs to use, and whether to use batch mode or weight mode.

Consider the following example, which starts with this graph:

computation for 1 RDU before tensor parallel is applied

When the compiler performs a split on dimension M:

  • RDU 0 and RDU 1 both have half of the computation workload.

  • The X-0/Y-0 and X-1/Y-1 weights are each half of the original weight and weight tensors.

  • At the end of compilation, we perform an additional Add operation to accumulate the results from both RDUs to get the final output.

  • There might be p2p communication between the two RDUs

    • At the beginning (for the input)

    • Right before the Add operation where RDU 1 might need to send its result to RDU 0.

computation for 2 RDUs

How to use tensor parallel compilation

When you compile your model, you can use command-line arguments to specify and customize tensor parallel mode.

Turn on tensor parallel mode

To compile using tensor parallel mode, you specify --tensor-parallel on the command line. You must specify batch or weight.

If you’re not running in tensor parallel mode, the default is batch. With tensor parallel mode, specifying batch or weight is required.
$ python <> compile --pef-name <pef_name> --tensor-parallel [batch|weight]
  • Batch mode splits on the batch dimension.

  • Weight mode splits on the weight dimension.

Which mode to choose depends on the relative size of the data tensors and the weight tensors.

  • If the data tensors are larger than the weight tensors, then batch mode results in better performance.

  • If the weight tensors are larger than data tensors, then weight mode results in better performance.

Using tensor parallel mode only makes sense for compilation. The generated PEF file encapsulates the properties that are specified by the arguments. During training and inference, the information that’s encapsulated in the PEF is then used to run the model.

Specify the number of RDUs (chips)

By default, tensor parallel compilation uses one RDU. You can use --n-chips to change the number of RDUs.

$ python <> compile --pef-name <pef_name> --tensor-parallel [batch|weight] --n-chips [1|2|8]
Tensor parallel mode works only within one node. On currently shipping hardware, the maximum number of RDUs is 8 per node.

How to use tensor parallel mode with an operator fusion rule

If you are compiling with o1, you can customize the compiler’s behavior with an operator fusion rule YAML file. In that YAML file, you can also customize the behavior of tensor parallel weight mode.

Operator fusion rules work only with weight mode. Batch mode ignores any yaml file tensor parallel dimension settings.

If you’re using Gemm heuristic, you can specify the dimension to be either M or K, as follows:

        name: SAFE_GEMM
        tensor parallel_dim: [M|K]

The M and K dimension specify either split M or K dimension of the weight tensors of the gemm operation. You can change the tensor parallel dimension for better performance. For NLP models the following settings usually results in the best performance:

  • Tensor parallel dimension of QKV is M

  • Attention projection gemm is K

  • FFN0 is M

  • FFN1 is K

Here is an example for part of Llama fusion rule:

Operator fusion rule example
    priority: 4
        name: SAFE_GEMM
        tensor parallel_dim: [M|K]
            op_type: to
            child: ln_pow_9
            required: false
            op_type: pow
            child: ln_mean_10
            op_type: mean
            child: ln_add_11
            op_type: add
            child: ln_rsqrt_12
            op_type: rsqrt
            child: ln_mul_13
            op_type: mul
            child: ln_mul_15
            op_type: mul
              - q_cast
              - k_cast
              - v_cast
            op_type: linear
            op_type: linear
            op_type: linear
            op_type: type_
            child: linear_q_16
            required: false
            op_type: type_
            child: linear_k_17
            required: false
            op_type: type_
            child: linear_v_18
            required: false

    priority: 3
        name: SAFE_GEMM
        tensor parallel_dim: K
        attn_outensor parallelut_cast:
            op_type: type_
            required: false
            child: linear_48
            op_type: linear
            child: post_linear_cast
            op_type: type_
            required: false
            child: add_49
            op_type: add
            child: pre_rmsnorm_cast
            op_type: to
            required: false
              - ln_pow_50
              - ln_mul_54
            op_type: pow
            child: ln_mean_51
            op_type: mean
            child: ln_add_52
            op_type: add
            child: ln_rsqrt_53
            op_type: rsqrt
            child: ln_mul_54
            op_type: mul
            child: post_rmsnorm_cast
            op_type: to
            required: false
            child: ln_mul_56
            op_type: mul

Sizing guidelines

The shape of your model determines:

  • Whether it’s a good idea to compile your model with tensor parallel turned on.

  • Which mode (batch mode or weight mode) gives the best results.

Here are some general guidelines:

  • For large models, more RDUs ensure that the required memory does not exceed what is available. In addition, the compiler can derive a better solution for large models with the flexibility of more RDUs.

  • For small models, allocating more RDUs than required can result in low utilization with much of the RDU capacity being idle. In addition, RDU communication overhead can exceed the gains from the increased compute power of additional RDUs.

The following guidelines help you make the right decision.

These guidelines are for SN-10 2PDU (aka SN20) and SN30 systems. We’ll add guidelines for SN40L at a later date.

Model-specific size guidance

You must set the tensor parallel size for each model correctly. The following table shows these suggesting values:

Model size Number of RDUs, mode (SN20) Number of RDUs, mode (SN30)

< 20B


1, batch mode

20B - 40B

2, weight mode

1, weight mode

40B - 80B

4, weight mode

2, weight mode

80B - 176B

8, weight mode

4 or 8, weight mode

Guidelines for other model parameters

Hidden dimension size

Model size is related to both the size of the hidden dimension size and the number of layers. In module tests where we only have the hidden dimension size, we suggest testing following tensor parallel size for different hidden dimsion sizes.

Hidden dimension size SN20 SN30

<= 4096

1 or 2, weight mode

1, batch mode or weight mode

4096 - 8192

2 or 4, weight mode

1 or 2, weight mode

>= 8192

4 or 8, weight mode

2 or 4, weight mode

Number of attention heads

Number of attention heads does not affect tensor parallel size. However, ensure that the number of attention heads is divisible by the tensor parallel size.

Sequence Length

The suggestions in the tables above are based on regular sequence length not greater than 4K. For long sequence lengths (>=32k). If you hit the memory limit:

  1. First reduce the batch size.

  2. If compilation still hits the memory limit for batch size 1, increase the tensor parallel size.

Use flash attention for >=8k sequence length.


The suggestions above are based on BF16 precision. If you’re using FP32 or mixed precision and you hit the memory limit, increase the number of RDUs.

Cached Inference Multigraph

For cached inference, the utilization bottleneck is on weight loading rather than computation. SambaNova recommends using more RDUs for tensor parallel weight mode on cached inference multigraph. For example, for Llama 7B or 13B, you could use 8 (SN30) or 4 (SN20) tensor parallel weight mode.