SambaStack supports two deployment configurations for supported models: high-interactivity and high-throughput. Both use the same model weights and the same API — they differ only in how the system handles requests. Use high-interactivity for low-latency, user-facing applications. Use high-throughput for batch or high-concurrency workloads where aggregate output matters more than per-user latency. The rest of this page covers the trade-offs, the PEF configurations for each, and how to build a bundle.Documentation Index
Fetch the complete documentation index at: https://sambanova-systems.mintlify.dev/docs/llms.txt
Use this file to discover all available pages before exploring further.
High-throughput and high-interactivity configurations require dedicated systems. Models deployed in either configuration cannot be bundled with other models. If you are unfamiliar with bundles or bundle templates, see Deploying model bundles.
Deployment configurations
| Configuration | Profile | Best when |
|---|---|---|
| High-throughput | Aggregate token throughput across concurrent requests | Batch processing, asynchronous workloads, or large user volumes where total system output matters more than per-user latency |
| High-interactivity | Per-request latency and time-to-first-token | Real-time, user-facing applications |
Both configurations use the same model name in API calls. The same request works against either configuration:The configuration controls request handling on the server side; no client-side changes are required.
When to use each configuration
High-throughput
Use the high-throughput configuration when:- You are serving many concurrent users and aggregate throughput matters more than per-user latency
- Your workload is asynchronous or batch-oriented (for example, document processing or offline inference pipelines)
- End-to-end latency per request is not a constraint
High-interactivity
Use the high-interactivity configuration when:- You are building real-time, user-facing applications
- Per-user time-to-first-token and tokens-per-second are the primary metrics
- Your deployment has fewer nodes, or your users have tight latency budgets
Supported models
Both configurations are available for the following models:- DeepSeek-R1
- DeepSeek-V3-0324
- DeepSeek-V3.1
- DeepSeek-V3.1-Terminus
- DeepSeek-V3.2
Architecture
The high-throughput configuration uses continuous batching, separating the prefill and decode phases into a dedicated pipeline. Two modes are available:- Aggregated (ACB): Prefill and decode run collocated on the same nodes.
- Disaggregated (DCB): Prefill and decode run on separate dedicated nodes, so each phase can be sized independently. The recommended node split is more prefill nodes than decode nodes — for example, three prefill nodes and one decode node.
- Prefill nodes process the input prompt
- Decode nodes generate output tokens
Requirements and limitations
| Constraint | Details |
|---|---|
| Minimum nodes (high-throughput) | 4 nodes in a multinode setup |
| Dedicated systems | High-throughput and high-interactivity configurations require dedicated systems — bundling with other models is not supported |
| Node configuration (disaggregated mode) | More prefill nodes than decode nodes required (for example, 3 prefill : 1 decode) |
| Checkpoint version | Use the latest checkpoint version listed in the Model CR — older versions are not compatible with high-throughput PEFs |
PEF configurations
Use the following PEF CR identifiers when building your bundles. See Custom Bundle Deployment for the full bundle-building procedure.Custom Resource (CR): A Kubernetes extension object. Model CRs and PEF CRs define model and PEF configurations in the cluster.
When referencing a PEF CR in a BundleTemplate, append the version number: for example,
deepseek-ss8192-bs1:1. Use version 1 unless kubectl describe pef <pef-name> shows a higher stable version is available.High-throughput PEFs
| PEF CR | Sequence length | Batch size |
|---|---|---|
deepseek-ss32768-bs1-cb2-64 | 32768 (32K) | 64 |
deepseek-ss16384-bs1-cb2-128 | 16384 (16K) | 128 |
deepseek-ss8192-bs1-cb2-256 | 8192 (8K) | 256 |
- Choose the sequence length (
ss) that fits your longest prompt plus expected output tokens. - Higher batch sizes serve more concurrent decode requests per node but require more RDU memory. The table lists the supported combinations.
High-interactivity PEFs
| PEF CR | Sequence length | Batch size |
|---|---|---|
deepseek-ss4096-bs1 | 4096 (4K) | 1 |
deepseek-ss4096-bs4 | 4096 (4K) | 4 |
deepseek-ss8192-bs1 | 8192 (8K) | 1 |
deepseek-ss8192-bs4 | 8192 (8K) | 4 |
deepseek-ss16384-bs1 | 16384 (16K) | 1 |
deepseek-ss32768-bs1 | 32768 (32K) | 1 |
deepseek-ss131072-bs1 | 131072 (128K) | 1 |
- Match the sequence length to your prompt plus expected output budget.
bs1minimizes per-user latency.bs4trades a small latency increase for higher per-node throughput when you have multiple concurrent users.
Build a bundle
No prebuilt bundles ship for these configurations — you create a custom bundle using the PEF CRs listed above. Follow the Custom Bundle Deployment guide and reference the relevant PEF CR when defining your BundleTemplate. When using a high-throughput PEF in your BundleTemplate, setcontinuous_batching: true in the expert definition:
Verify your deployment
After deploying the bundle, confirm the configuration is active:continuous_batching.mode set to aggregate (ACB) or disaggregate (DCB), and confirm the replica counts under prefill and decode match what you configured.
Switch between configurations
To switch between high-throughput and high-interactivity, redeploy the bundle with the appropriate PEF and BundleTemplate settings. When switching to high-interactivity, removecontinuous_batching: true from the expert and remove the continuous_batching block from the BundleDeployment. The model name in API calls does not change.
Monitor your deployment
The SambaStack logging system emits per-request metrics relevant to these deployments:| Metric | Log key | What it tells you |
|---|---|---|
| Decode queue time | decode_queue_time | Time spent waiting in continuous batching queues — high values indicate decode saturation |
| Time to first token | time_to_first_token | Prefill latency per request — key indicator for high-interactivity deployments |
| Completion tokens/sec | completion_tokens_per_sec | Aggregate throughput — key indicator for high-throughput deployments |

