SambaStack deploys models using bundles—packaged groups of one or more models with their deployment configurations, including batch sizes, sequence lengths, and precision settings.
For example, deploying Llama-3.3-70B with a batch size of 4 and a sequence length of 16k represents one configuration. A single bundle can contain multiple configurations across different models.
SambaNova’s RDU architecture supports loading multiple models and configurations in a single deployment, enabling instant switching between them without reloading weights. This approach increases efficiency, flexibility, and throughput compared to traditional GPU deployments that load a single static model.
A configuration defines the runtime settings for a model deployment: batch size, sequence length, and precision. A single deployment can include multiple configurations, enabling instant switching between setups for optimized performance.
Concepts
Bundle templates
A bundle template defines which models and configurations can be deployed together on a single node. Each bundle template contains one or more model templates, which specify the supported sequence lengths and batch sizes.
Example model templates:
| Model Template | Supported Sequence Lengths | Supported Batch Sizes |
|---|
DeepSeek-R1-0528-Template | 4k, 8k | 1, 2, 4, 8 |
DeepSeek-V3-0324-Template | 4k, 8k | 1, 2, 4, 8 |
Example bundle template:
deepseek-r1-v3-fp8-32k-Template combines both DeepSeek-R1-0528-Template and DeepSeek-V3-0324-Template into a single deployable package.
Bundles
A bundle associates trained checkpoints (model weights) with the model templates defined in a bundle template.
For example, the bundle deepseek-r1-v3-fp8-32k links the R1-0528 and V3-0324 checkpoints to their corresponding model templates within deepseek-r1-v3-fp8-32k-Template.
Key terminology
| Term | Definition |
|---|
| Bundle | A deployable package combining models, checkpoints, and configurations |
| Expert | A model instance bound to a specific sequence length (e.g., DeepSeek-R1-Distill-Llama-70B-4K) |
| Expert config | An expert with a specific batch size (e.g., DeepSeek-R1-Distill-Llama-70B-4k-BS2) |
The bundles section in your sambastack.yaml configuration file has two subsections:
| Subsection | Purpose |
|---|
bundleSpecs | Declares which bundles are available for deployment. This registers bundles with the system but does not deploy them. |
bundleDeploymentSpecs | Defines how bundles are deployed across the cluster, including replica counts and QoS levels. |
For custom bundles, see Custom Bundle Deployment.
Example configuration:
bundles:
bundleSpecs:
- name: llama-4-medium
bundleDeploymentSpecs:
- name: llama-4-medium
groups:
- name: "default"
minReplicas: 1
qosList:
- "web"
- "free"
Switch bundles
To switch to a different bundle, update both bundleSpecs and bundleDeploymentSpecs with the new bundle name.
See the Models page for available bundles. To request new bundle templates, contact SambaNova support.
Example:
bundles:
bundleSpecs:
- name: qwen3-32b-whisper
bundleDeploymentSpecs:
- name: qwen3-32b-whisper
groups:
- name: "default"
minReplicas: 1
qosList:
- "web"
- "free"
Apply the changes:
kubectl apply -f sambastack.yaml
A successful update returns:
configmap/sambastack configured
Deploy multiple bundles
To deploy multiple bundles simultaneously, list each bundle in both bundleSpecs and bundleDeploymentSpecs:
bundles:
bundleSpecs:
- name: llama-4-medium
- name: qwen3-32b-whisper
bundleDeploymentSpecs:
- name: llama-4-medium
groups:
- name: "default"
minReplicas: 1
qosList:
- "web"
- "free"
- name: qwen3-32b-whisper
groups:
- name: "default"
minReplicas: 1
qosList:
- "web"
- "free"
SambaStack supports only one bundle per node. When deploying multiple bundles, assign each bundle to separate nodes to avoid resource conflicts.
Verify deployment status
Check that pods reflect the updated bundle configuration:
Deploy custom checkpoints
Custom checkpoints use the same deployment process as SambaNova-provided checkpoints but require conversion first.
Prerequisites:
- A converted checkpoint (any team member with appropriate access can perform the conversion)
- The Google Cloud Storage path to the converted checkpoint
For conversion instructions, see the Custom Checkpoint Deployment Guide.
Create a bundle configuration
Create a bundle configuration that references your custom checkpoint’s storage location:
apiVersion: sambanova.ai/v1alpha1
kind: Bundle
metadata:
name: 70b-3dot3-ss-4-8-64-128k
spec:
checkpoints:
LLAMA3D2_1B_CKPT:
source: gs://your-bucket/path/to/converted/checkpoint
Replace LLAMA3D2_1B_CKPT with the checkpoint key expected by your bundle template, and update the source path to your checkpoint location.
Deploy with speculative decoding
Speculative decoding uses a smaller draft model to predict tokens that a larger target model then verifies, potentially improving throughput.
Requirements
- The draft checkpoint must be compatible with your target checkpoint
- For custom models, use a fine-tuned draft checkpoint that matches your target model’s domain
Using speculative decoding without a properly tuned draft checkpoint can degrade performance. If you don’t have a compatible draft checkpoint, use a bundle template without speculative decoding.
Validate compatibility
Use the SN Conversion Library to validate draft-target checkpoint compatibility before deployment. See the Speculative Decoding guide for details.
Specify both checkpoints in your bundle configuration:
apiVersion: sambanova.ai/v1alpha1
kind: Bundle
metadata:
name: 70b-3dot3-ss-4-8-64-128k
spec:
checkpoints:
LLAMA3D2_1B_CKPT:
source: gs://your-bucket/path/to/draft/checkpoint
LLAMA3_70B_3_3_CKPT:
source: gs://your-bucket/path/to/target/checkpoint
Speculative decoding does not affect output accuracy—the target model always makes the final token decisions. However, performance gains depend on the acceptance rate (how often the target model accepts the draft model’s predictions). A well-matched draft checkpoint typically yields higher acceptance rates and better throughput.