SambaStack supports two deployment configurations for supported models: high-interactivity and high-throughput. Both use the same model weights and the same API — they differ only in how the system handles requests. Use high-interactivity for low-latency, user-facing applications. Use high-throughput for batch or high-concurrency workloads where aggregate output matters more than per-user latency. The rest of this page covers the trade-offs, the PEF configurations for each, and how to build a bundle.
High-throughput and high-interactivity configurations require dedicated systems. Models deployed in either configuration cannot be bundled with other models. If you are unfamiliar with bundles or bundle templates, see Deploying model bundles.
Deployment configurations
| Configuration | Profile | Best when |
|---|
| High-throughput | Aggregate token throughput across concurrent requests | Batch processing, asynchronous workloads, or large user volumes where total system output matters more than per-user latency |
| High-interactivity | Per-request latency and time-to-first-token | Real-time, user-facing applications |
Both configurations use the same model name in API calls. The same request works against either configuration:{ "model": "DeepSeek-R1", "messages": [...] }
The configuration controls request handling on the server side; no client-side changes are required.
When to use each configuration
High-throughput
Use the high-throughput configuration when:
- You are serving many concurrent users and aggregate throughput matters more than per-user latency
- Your workload is asynchronous or batch-oriented (for example, document processing or offline inference pipelines)
- End-to-end latency per request is not a constraint
High-interactivity
Use the high-interactivity configuration when:
- You are building real-time, user-facing applications
- Per-user time-to-first-token and tokens-per-second are the primary metrics
- Your deployment has fewer nodes, or your users have tight latency budgets
Supported models
Both configurations are available for the following models:
- DeepSeek-R1
- DeepSeek-V3-0324
- DeepSeek-V3.1
- DeepSeek-V3.1-Terminus
- DeepSeek-V3.2
Architecture
The high-throughput configuration uses continuous batching, separating the prefill and decode phases into a dedicated pipeline. Two modes are available:
- Aggregated (ACB): Prefill and decode run collocated on the same nodes.
- Disaggregated (DCB): Prefill and decode run on separate dedicated nodes, so each phase can be sized independently. The recommended node split is more prefill nodes than decode nodes — for example, three prefill nodes and one decode node.
DCB has not yet been internally validated on SambaStack. Use ACB for SambaStack deployments until DCB validation is published.
- Prefill nodes process the input prompt
- Decode nodes generate output tokens
The high-throughput configuration requires a minimum of 4 nodes in disaggregated mode. For single-node or small deployments, use the high-interactivity configuration instead.
Requirements and limitations
| Constraint | Details |
|---|
| Minimum nodes (high-throughput) | 4 nodes in a multinode setup |
| Dedicated systems | High-throughput and high-interactivity configurations require dedicated systems — bundling with other models is not supported |
| Node configuration (disaggregated mode) | More prefill nodes than decode nodes required (for example, 3 prefill : 1 decode) |
| Checkpoint version | Use the latest checkpoint version listed in the Model CR — older versions are not compatible with high-throughput PEFs |
PEF configurations
Use the following PEF CR identifiers when building your bundles. See Custom Bundle Deployment for the full bundle-building procedure.
Custom Resource (CR): A Kubernetes extension object. Model CRs and PEF CRs define model and PEF configurations in the cluster.
When referencing a PEF CR in a BundleTemplate, append the version number: for example, deepseek-ss8192-bs1:1. Use version 1 unless kubectl describe pef <pef-name> shows a higher stable version is available.
High-throughput PEFs
| PEF CR | Sequence length | Batch size |
|---|
deepseek-ss32768-bs1-cb2-64 | 32768 (32K) | 64 |
deepseek-ss16384-bs1-cb2-128 | 16384 (16K) | 128 |
deepseek-ss8192-bs1-cb2-256 | 8192 (8K) | 256 |
Picking a high-throughput PEF:
- Choose the sequence length (
ss) that fits your longest prompt plus expected output tokens.
- Higher batch sizes serve more concurrent decode requests per node but require more RDU memory. The table lists the supported combinations.
High-interactivity PEFs
| PEF CR | Sequence length | Batch size |
|---|
deepseek-ss4096-bs1 | 4096 (4K) | 1 |
deepseek-ss4096-bs4 | 4096 (4K) | 4 |
deepseek-ss8192-bs1 | 8192 (8K) | 1 |
deepseek-ss8192-bs4 | 8192 (8K) | 4 |
deepseek-ss16384-bs1 | 16384 (16K) | 1 |
deepseek-ss32768-bs1 | 32768 (32K) | 1 |
deepseek-ss131072-bs1 | 131072 (128K) | 1 |
Picking a high-interactivity PEF:
- Match the sequence length to your prompt plus expected output budget.
bs1 minimizes per-user latency. bs4 trades a small latency increase for higher per-node throughput when you have multiple concurrent users.
Build a bundle
No prebuilt bundles ship for these configurations — you create a custom bundle using the PEF CRs listed above. Follow the Custom Bundle Deployment guide and reference the relevant PEF CR when defining your BundleTemplate.
When using a high-throughput PEF in your BundleTemplate, set continuous_batching: true in the expert definition:
DeepSeek-V3-0324:
experts:
8k:
configs:
- continuous_batching: true
pef: deepseek-ss8192-bs1-cb2-256:1
Then configure your BundleDeployment for the appropriate mode.
Aggregated mode (ACB):
groups:
- name: default
continuous_batching:
mode: aggregate
minReplicas: 1
qosList:
- web
- free
Disaggregated mode (DCB):
groups:
- name: default
continuous_batching:
use_mpi: true
prefill:
minReplicas: 3
decode:
minReplicas: 1
minReplicas: 1
qosList:
- web
- free
Verify your deployment
After deploying the bundle, confirm the configuration is active:
kubectl describe bundledeployment <bundle-deployment-name>
In the output, look for continuous_batching.mode set to aggregate (ACB) or disaggregate (DCB), and confirm the replica counts under prefill and decode match what you configured.
Switch between configurations
To switch between high-throughput and high-interactivity, redeploy the bundle with the appropriate PEF and BundleTemplate settings. When switching to high-interactivity, remove continuous_batching: true from the expert and remove the continuous_batching block from the BundleDeployment. The model name in API calls does not change.
Monitor your deployment
The SambaStack logging system emits per-request metrics relevant to these deployments:
| Metric | Log key | What it tells you |
|---|
| Decode queue time | decode_queue_time | Time spent waiting in continuous batching queues — high values indicate decode saturation |
| Time to first token | time_to_first_token | Prefill latency per request — key indicator for high-interactivity deployments |
| Completion tokens/sec | completion_tokens_per_sec | Aggregate throughput — key indicator for high-throughput deployments |
See Logs for the full list of available metrics and example queries.