SambaStack Model Bundle Deployment - SambaNova Documentation

SambaStack serves models by deploying bundles—packaged groups of one or more models with their deployment configurations, including batch sizes and sequence lengths. This allows many models of interest to be served simultaneously, enabling instant switching between them without reloading weights. This approach increases efficiency, flexibility, and throughput compared to traditional GPU deployments that load a single static model. For example, a single bundle can have a Llama-3.3-70B-Instruct and a Llama-3.1-8B-Instruct, enabling you to switch between them almost instantly. Every configuration of the model in the bundle takes up space, so different bundles contain different sets of configurations — a copy of Llama-3.3-70B-Instruct with a batch size of 4 and a sequence length of 16K represents one configuration. You can choose the bundle that works best for you based on your use case. See the supported models and bundles page for the current list.

A configuration defines the runtime settings for a model deployment, including batch size and sequence length. A single deployment can include multiple configurations, enabling instant switching between setups for optimized performance.

Concepts

Bundle templates

A bundle template defines which models and configurations can be deployed together on a single node. Each bundle template contains one or more model templates, which specify the supported sequence lengths and batch sizes. For a detailed example of how model templates are defined inside a BundleTemplate, see the Custom Bundle Deployment → BundleTemplate Structure. Example model templates:

Model Template	Supported Sequence Lengths	Supported Batch Sizes
`DeepSeek-R1-0528-Template`	4k, 8k	1, 2, 4, 8
`DeepSeek-V3-0324-Template`	4k, 8k	1, 2, 4, 8

Example bundle template: deepseek-r1-v3-fp8-32k-Template combines both DeepSeek-R1-0528-Template and DeepSeek-V3-0324-Template into a single deployable package. For a full YAML example of a BundleTemplate definition, see the Custom Bundle Deployment → BundleTemplate Structure section.

Bundles

A bundle associates trained checkpoints (model weights) with the model templates defined in a bundle template. For example, the bundle deepseek-r1-v3-fp8-32k links the R1-0528 and V3-0324 checkpoints to their corresponding model templates within deepseek-r1-v3-fp8-32k-Template.

Key terminology

Term	Definition
Bundle	A deployable package combining models, checkpoints, and configurations
Expert	A model instance bound to a specific sequence length (e.g., `DeepSeek-R1-Distill-Llama-70B-4K`)
Expert config	An expert with a specific batch size (e.g., `DeepSeek-R1-Distill-Llama-70B-4k-BS2`)

Configure bundles

The bundles section in your sambastack.yaml configuration file has two subsections:

Subsection	Purpose
`bundleSpecs`	Declares which bundles are available for deployment. This registers bundles with the system but does not deploy them.
`bundleDeploymentSpecs`	Defines how bundles are deployed across the cluster, including replica counts and QoS levels.

For custom bundles, see Custom Bundle Deployment. Example configuration:

bundles:
  bundleSpecs:
    - name: gpt-oss-120b-8-32-64-128k

  bundleDeploymentSpecs:
    - name: gpt-oss-120b-8-32-64-128k
      groups:
        - name: "default"
          minReplicas: 1
          qosList:
            - "web"
            - "free"

See the SambaStack.yaml Reference for a full example.

Switch bundles

To switch to a different bundle, update both bundleSpecs and bundleDeploymentSpecs with the new bundle name. See the Supported Models and Bundles for available bundles. To request new bundle templates, contact SambaNova support. Example:

bundles:
  bundleSpecs:
    - name: qwen3-32b-whisper

  bundleDeploymentSpecs:
    - name: qwen3-32b-whisper
      groups:
        - name: "default"
          minReplicas: 1
          qosList:
            - "web"
            - "free"

See the SambaStack.yaml Reference for a full example.

Apply the changes:

kubectl apply -f sambastack.yaml

A successful update returns:

configmap/sambastack configured

Deploy multiple bundles

To deploy multiple bundles simultaneously, list each bundle in both bundleSpecs and bundleDeploymentSpecs:

bundles:
  bundleSpecs:
    - name: gpt-oss-120b-8-32-64-128k
    - name: qwen3-32b-whisper

  bundleDeploymentSpecs:
    - name: gpt-oss-120b-8-32-64-128k
      groups:
        - name: "default"
          minReplicas: 1
          qosList:
            - "web"
            - "free"

    - name: qwen3-32b-whisper
      groups:
        - name: "default"
          minReplicas: 1
          qosList:
            - "web"
            - "free"

See the SambaStack.yaml Reference for a full example.

SambaStack supports only one bundle per node. When deploying multiple bundles, assign each bundle to separate nodes to avoid resource conflicts.

Verify deployment status

Check that pods reflect the updated bundle configuration:

kubectl get pods

The legalizer validates the bundle’s resource requirements before deployment. To inspect its results including resource utilization, run:

kubectl get bundle <bundle-name> -o yaml

The status.legalizerInfo section shows legalizer results including resource utilization:

# Legalizer passed
status:
  legalizerInfo:
    errors: []
    status: Legalizer passed
    utilization:
      ddr: '0.75'
      hbm_resident: '0.65'
      host: '0.30'
    warnings: []

# Legalizer failed
status:
  legalizerInfo:
    errors:
    - 'PEF pef1 and pef2 are not checkpoint compatible (checkpoint #0)'
    status: Legalizer failed
    utilization:
      ddr: '0.75'
      hbm_resident: '0.65'
      host: '0.30'
    warnings: []

# Legalizer skipped (skip_legalizer: true set on a previously-legalized bundle)
status:
  legalizerInfo:
    errors: []
    status: Legalizer was skipped
    utilization:
      ddr: N/A
      hbm_resident: N/A
      host: N/A
    warnings: []

Field	Description
`status.legalizerInfo.status`	Human-readable legalizer result: `Legalizer passed`, `Legalizer failed`, or `Legalizer was skipped`; absent if legalizer output could not be processed
`status.legalizerInfo.errors`	List of validation errors from the legalizer
`status.legalizerInfo.warnings`	List of validation warnings from the legalizer
`status.legalizerInfo.utilization.ddr`	DDR memory utilization as a quoted string (e.g., `'0.75'` represents 75%)
`status.legalizerInfo.utilization.hbm_resident`	HBM resident memory utilization as a quoted string (e.g., `'0.65'` represents 65%; values above `'1.0'` indicate over-allocation)
`status.legalizerInfo.utilization.host`	Host memory utilization as a quoted string (e.g., `'0.30'` represents 30%)

The utilization fields show N/A when skip_legalizer: true is set on a bundle that has been legalized at least once. If the bundle has never been legalized (for example, skip_legalizer: true was set from initial deployment), legalizerInfo is absent entirely. The utilization field may also be absent if the legalizer output could not be parsed.

Deploy custom checkpoints

Custom checkpoints use the same deployment process as SambaNova-provided checkpoints but require conversion first. Prerequisites:

A converted checkpoint (any team member with appropriate access can perform the conversion)
The Google Cloud Storage path to the converted checkpoint

For conversion instructions, see the Custom Checkpoint Deployment Guide.

Create a bundle configuration

Create a bundle configuration that references your custom checkpoint’s storage location:

apiVersion: sambanova.ai/v1alpha1
kind: Bundle
metadata:
  name: 70b-3dot3-ss-4-8-64-128k
spec:
  checkpoints:
    LLAMA3D2_1B_CKPT:
      source: gs://your-bucket/path/to/converted/checkpoint

Replace LLAMA3D2_1B_CKPT with the checkpoint key expected by your bundle template, and update the source path to your checkpoint location. The checkpoint keys used in this configuration must match the keys defined in the corresponding BundleTemplate. For an example of how these keys are defined, see the Custom Bundle Deployment → BundleTemplate Structure.

Deploy with speculative decoding

Speculative decoding uses a smaller draft model to predict tokens that a larger target model then verifies, potentially improving throughput.

Requirements

The draft checkpoint must be compatible with your target checkpoint
For custom models, use a fine-tuned draft checkpoint that matches your target model’s domain

Using speculative decoding without a properly tuned draft checkpoint can degrade performance. If you don’t have a compatible draft checkpoint, use a bundle template without speculative decoding.

Validate compatibility

Use the SN Conversion Library to validate draft-target checkpoint compatibility before deployment. See the Speculative Decoding guide for details.

Configure draft and target checkpoints

Specify both checkpoints in your bundle configuration:

apiVersion: sambanova.ai/v1alpha1
kind: Bundle
metadata:
  name: 70b-3dot3-ss-4-8-64-128k
spec:
  checkpoints:
    LLAMA3D2_1B_CKPT:
      source: gs://your-bucket/path/to/draft/checkpoint
    LLAMA3_70B_3_3_CKPT:
      source: gs://your-bucket/path/to/target/checkpoint

Performance considerations

Speculative decoding does not affect output accuracy—the target model always makes the final token decisions. However, performance gains depend on the acceptance rate (how often the target model accepts the draft model’s predictions). A well-matched draft checkpoint typically yields higher acceptance rates and better throughput.

​Concepts

​Bundle templates

​Bundles

​Key terminology

​Configure bundles

​Switch bundles

​Deploy multiple bundles

​Verify deployment status

​Deploy custom checkpoints

​Create a bundle configuration

​Deploy with speculative decoding

​Requirements

​Validate compatibility

​Configure draft and target checkpoints

​Performance considerations

​Related resources

Concepts

Bundle templates

Bundles

Key terminology

Configure bundles

Switch bundles

Deploy multiple bundles

Verify deployment status

Deploy custom checkpoints

Create a bundle configuration

Deploy with speculative decoding

Requirements

Validate compatibility

Configure draft and target checkpoints

Performance considerations

Related resources