SambaStack Deployment Configurations - High-Throughput and High-Interactivity

SambaStack supports two deployment configurations for supported models: high-interactivity and high-throughput. Both use the same model weights and the same API — they differ only in how the system handles requests. Use high-interactivity for low-latency, user-facing applications. Use high-throughput for batch or high-concurrency workloads where aggregate output matters more than per-user latency. The rest of this page covers the trade-offs, the PEF configurations for each, and how to build a bundle.

High-throughput and high-interactivity configurations require dedicated systems. Models deployed in either configuration cannot be bundled with other models. If you are unfamiliar with bundles or bundle templates, see Deploying model bundles.

Deployment configurations

Configuration	Profile	Best when
High-throughput	Aggregate token throughput across concurrent requests	Batch processing, asynchronous workloads, or large user volumes where total system output matters more than per-user latency
High-interactivity	Per-request latency and time-to-first-token	Real-time, user-facing applications

Both configurations use the same model name in API calls. The same request works against either configuration:

{ "model": "DeepSeek-R1", "messages": [...] }

The configuration controls request handling on the server side; no client-side changes are required.

When to use each configuration

High-throughput

Use the high-throughput configuration when:

You are serving many concurrent users and aggregate throughput matters more than per-user latency
Your workload is asynchronous or batch-oriented (for example, document processing or offline inference pipelines)
End-to-end latency per request is not a constraint

High-interactivity

Use the high-interactivity configuration when:

You are building real-time, user-facing applications
Per-user time-to-first-token and tokens-per-second are the primary metrics
Your deployment has fewer nodes, or your users have tight latency budgets

Supported models

Both configurations are available for the following models:

DeepSeek-R1
DeepSeek-V3-0324
DeepSeek-V3.1
DeepSeek-V3.1-Terminus
DeepSeek-V3.2

Architecture

The high-throughput configuration uses continuous batching, separating the prefill and decode phases into a dedicated pipeline. Two modes are available:

Aggregated (ACB): Prefill and decode run collocated on the same nodes.
Disaggregated (DCB): Prefill and decode run on separate dedicated nodes, so each phase can be sized independently. The recommended node split is more prefill nodes than decode nodes — for example, three prefill nodes and one decode node.

DCB has not yet been internally validated on SambaStack. Use ACB for SambaStack deployments until DCB validation is published.

Prefill nodes process the input prompt
Decode nodes generate output tokens

The high-throughput configuration requires a minimum of 4 nodes in disaggregated mode. For single-node or small deployments, use the high-interactivity configuration instead.

Requirements and limitations

Constraint	Details
Minimum nodes (high-throughput)	4 nodes in a multinode setup
Dedicated systems	High-throughput and high-interactivity configurations require dedicated systems — bundling with other models is not supported
Node configuration (disaggregated mode)	More prefill nodes than decode nodes required (for example, 3 prefill : 1 decode)
Checkpoint version	Use the latest checkpoint version listed in the Model CR — older versions are not compatible with high-throughput PEFs

PEF configurations

Use the following PEF CR identifiers when building your bundles. See Custom Bundle Deployment for the full bundle-building procedure.

Custom Resource (CR): A Kubernetes extension object. Model CRs and PEF CRs define model and PEF configurations in the cluster.

When referencing a PEF CR in a BundleTemplate, append the version number: for example, deepseek-ss8192-bs1:1. Use version 1 unless kubectl describe pef <pef-name> shows a higher stable version is available.

High-throughput PEFs

PEF CR	Sequence length	Batch size
`deepseek-ss32768-bs1-cb2-64`	32768 (32K)	64
`deepseek-ss16384-bs1-cb2-128`	16384 (16K)	128
`deepseek-ss8192-bs1-cb2-256`	8192 (8K)	256

Picking a high-throughput PEF:

Choose the sequence length (ss) that fits your longest prompt plus expected output tokens.
Higher batch sizes serve more concurrent decode requests per node but require more RDU memory. The table lists the supported combinations.

High-interactivity PEFs

PEF CR	Sequence length	Batch size
`deepseek-ss4096-bs1`	4096 (4K)	1
`deepseek-ss4096-bs4`	4096 (4K)	4
`deepseek-ss8192-bs1`	8192 (8K)	1
`deepseek-ss8192-bs4`	8192 (8K)	4
`deepseek-ss16384-bs1`	16384 (16K)	1
`deepseek-ss32768-bs1`	32768 (32K)	1
`deepseek-ss131072-bs1`	131072 (128K)	1

Picking a high-interactivity PEF:

Match the sequence length to your prompt plus expected output budget.
bs1 minimizes per-user latency. bs4 trades a small latency increase for higher per-node throughput when you have multiple concurrent users.

Build a bundle

No prebuilt bundles ship for these configurations — you create a custom bundle using the PEF CRs listed above. Follow the Custom Bundle Deployment guide and reference the relevant PEF CR when defining your BundleTemplate. When using a high-throughput PEF in your BundleTemplate, set continuous_batching: true in the expert definition:

DeepSeek-V3-0324:
  experts:
    8k:
      configs:
      - continuous_batching: true
        pef: deepseek-ss8192-bs1-cb2-256:1

Then configure your BundleDeployment for the appropriate mode. Aggregated mode (ACB):

groups:
  - name: default
    continuous_batching:
      mode: aggregate
    minReplicas: 1
    qosList:
    - web
    - free

Disaggregated mode (DCB):

groups:
  - name: default
    continuous_batching:
      use_mpi: true
      prefill:
        minReplicas: 3
      decode:
        minReplicas: 1
    minReplicas: 1
    qosList:
    - web
    - free

Verify your deployment

After deploying the bundle, confirm the configuration is active:

kubectl describe bundledeployment <bundle-deployment-name>

In the output, look for continuous_batching.mode set to aggregate (ACB) or disaggregate (DCB), and confirm the replica counts under prefill and decode match what you configured.

Switch between configurations

To switch between high-throughput and high-interactivity, redeploy the bundle with the appropriate PEF and BundleTemplate settings. When switching to high-interactivity, remove continuous_batching: true from the expert and remove the continuous_batching block from the BundleDeployment. The model name in API calls does not change.

Monitor your deployment

The SambaStack logging system emits per-request metrics relevant to these deployments:

Metric	Log key	What it tells you
Decode queue time	`decode_queue_time`	Time spent waiting in continuous batching queues — high values indicate decode saturation
Time to first token	`time_to_first_token`	Prefill latency per request — key indicator for high-interactivity deployments
Completion tokens/sec	`completion_tokens_per_sec`	Aggregate throughput — key indicator for high-throughput deployments

See Logs for the full list of available metrics and example queries.

​Deployment configurations

​When to use each configuration

​High-throughput

​High-interactivity

​Supported models

​Architecture

​Requirements and limitations

​PEF configurations

​High-throughput PEFs

​High-interactivity PEFs

​Build a bundle

​Verify your deployment

​Switch between configurations

​Monitor your deployment

Deployment configurations

When to use each configuration

High-throughput

High-interactivity

Supported models

Architecture

Requirements and limitations

PEF configurations

High-throughput PEFs

High-interactivity PEFs

Build a bundle

Verify your deployment

Switch between configurations

Monitor your deployment