Skip to main content
SambaStack deploys models using bundles—packaged groups of one or more models with their deployment configurations, including batch sizes, sequence lengths, and precision settings. For example, deploying Llama-3.3-70B with a batch size of 4 and a sequence length of 16k represents one configuration. A single bundle can contain multiple configurations across different models. SambaNova’s RDU architecture supports loading multiple models and configurations in a single deployment, enabling instant switching between them without reloading weights. This approach increases efficiency, flexibility, and throughput compared to traditional GPU deployments that load a single static model.
A configuration defines the runtime settings for a model deployment: batch size, sequence length, and precision. A single deployment can include multiple configurations, enabling instant switching between setups for optimized performance.

Concepts

Bundle templates

A bundle template defines which models and configurations can be deployed together on a single node. Each bundle template contains one or more model templates, which specify the supported sequence lengths and batch sizes. Example model templates:
Model TemplateSupported Sequence LengthsSupported Batch Sizes
DeepSeek-R1-0528-Template4k, 8k1, 2, 4, 8
DeepSeek-V3-0324-Template4k, 8k1, 2, 4, 8
Example bundle template: deepseek-r1-v3-fp8-32k-Template combines both DeepSeek-R1-0528-Template and DeepSeek-V3-0324-Template into a single deployable package.

Bundles

A bundle associates trained checkpoints (model weights) with the model templates defined in a bundle template. For example, the bundle deepseek-r1-v3-fp8-32k links the R1-0528 and V3-0324 checkpoints to their corresponding model templates within deepseek-r1-v3-fp8-32k-Template.

Key terminology

TermDefinition
BundleA deployable package combining models, checkpoints, and configurations
ExpertA model instance bound to a specific sequence length (e.g., DeepSeek-R1-Distill-Llama-70B-4K)
Expert configAn expert with a specific batch size (e.g., DeepSeek-R1-Distill-Llama-70B-4k-BS2)

Configure bundles

The bundles section in your sambastack.yaml configuration file has two subsections:
SubsectionPurpose
bundleSpecsDeclares which bundles are available for deployment. This registers bundles with the system but does not deploy them.
bundleDeploymentSpecsDefines how bundles are deployed across the cluster, including replica counts and QoS levels.
For custom bundles, see Custom Bundle Deployment. Example configuration:
bundles:
  bundleSpecs:
    - name: llama-4-medium

  bundleDeploymentSpecs:
    - name: llama-4-medium
      groups:
        - name: "default"
          minReplicas: 1
          qosList:
            - "web"
            - "free"

Switch bundles

To switch to a different bundle, update both bundleSpecs and bundleDeploymentSpecs with the new bundle name. See the Models page for available bundles. To request new bundle templates, contact SambaNova support. Example:
bundles:
  bundleSpecs:
    - name: qwen3-32b-whisper

  bundleDeploymentSpecs:
    - name: qwen3-32b-whisper
      groups:
        - name: "default"
          minReplicas: 1
          qosList:
            - "web"
            - "free"
Apply the changes:
kubectl apply -f sambastack.yaml
A successful update returns:
configmap/sambastack configured

Deploy multiple bundles

To deploy multiple bundles simultaneously, list each bundle in both bundleSpecs and bundleDeploymentSpecs:
bundles:
  bundleSpecs:
    - name: llama-4-medium
    - name: qwen3-32b-whisper

  bundleDeploymentSpecs:
    - name: llama-4-medium
      groups:
        - name: "default"
          minReplicas: 1
          qosList:
            - "web"
            - "free"

    - name: qwen3-32b-whisper
      groups:
        - name: "default"
          minReplicas: 1
          qosList:
            - "web"
            - "free"
SambaStack supports only one bundle per node. When deploying multiple bundles, assign each bundle to separate nodes to avoid resource conflicts.

Verify deployment status

Check that pods reflect the updated bundle configuration:
kubectl get pods

Deploy custom checkpoints

Custom checkpoints use the same deployment process as SambaNova-provided checkpoints but require conversion first. Prerequisites:
  • A converted checkpoint (any team member with appropriate access can perform the conversion)
  • The Google Cloud Storage path to the converted checkpoint
For conversion instructions, see the Custom Checkpoint Deployment Guide.

Create a bundle configuration

Create a bundle configuration that references your custom checkpoint’s storage location:
apiVersion: sambanova.ai/v1alpha1
kind: Bundle
metadata:
  name: 70b-3dot3-ss-4-8-64-128k
spec:
  checkpoints:
    LLAMA3D2_1B_CKPT:
      source: gs://your-bucket/path/to/converted/checkpoint
Replace LLAMA3D2_1B_CKPT with the checkpoint key expected by your bundle template, and update the source path to your checkpoint location.

Deploy with speculative decoding

Speculative decoding uses a smaller draft model to predict tokens that a larger target model then verifies, potentially improving throughput.

Requirements

  • The draft checkpoint must be compatible with your target checkpoint
  • For custom models, use a fine-tuned draft checkpoint that matches your target model’s domain
Using speculative decoding without a properly tuned draft checkpoint can degrade performance. If you don’t have a compatible draft checkpoint, use a bundle template without speculative decoding.

Validate compatibility

Use the SN Conversion Library to validate draft-target checkpoint compatibility before deployment. See the Speculative Decoding guide for details.

Configure draft and target checkpoints

Specify both checkpoints in your bundle configuration:
apiVersion: sambanova.ai/v1alpha1
kind: Bundle
metadata:
  name: 70b-3dot3-ss-4-8-64-128k
spec:
  checkpoints:
    LLAMA3D2_1B_CKPT:
      source: gs://your-bucket/path/to/draft/checkpoint
    LLAMA3_70B_3_3_CKPT:
      source: gs://your-bucket/path/to/target/checkpoint

Performance considerations

Speculative decoding does not affect output accuracy—the target model always makes the final token decisions. However, performance gains depend on the acceptance rate (how often the target model accepts the draft model’s predictions). A well-matched draft checkpoint typically yields higher acceptance rates and better throughput.