SambaStack Supported Models and Bundles

SambaStack supports a variety of models that can be deployed to both on-prem and hosted environments. Contact your system administrator to determine which models are available on your deployment.

Deployment options

When deploying models in SambaStack, administrators can select from various context length and batch size combinations.

Smaller batch sizes provide higher token throughput (tokens/second).
Larger batch sizes provide better concurrency for multiple users.

SambaStack supports two deployment configurations — high-interactivity and high-throughput. See High-throughput deployment for when to use each.

Supported models

You can run the following command to discover available models in your cluster:

kubectl -n <namespace> get models

The table below lists supported models, context lengths, batch sizes, and features.

Developer/Model ID	Modalities	Suggested Use	Context length (batch size)	Features and optimizations	View on Hugging Face
Meta
`Meta-Llama-3.3-70B-Instruct`	Text	Task agent Tool-calling agent Text to SQL/Cipher	View 4K (1, 2, 4, 8, 16, 32) 8K (1, 2, 4, 8, 16, 32) 16K (1, 2, 4) 32K (1, 2, 4) 64K (1, 2, 4) 128K (1)	View Endpoint: Chat completions Capabilities: Function calling, JSON mode Custom checkpoints supported: Yes Optimizations: Speculative decoding	Model card
`Meta-Llama-3.1-8B-Instruct`	Text	Gateway agent Validation agent	View 4K (1, 2, 4, 8, 16, 32, 64, 128) 8K (1, 2, 4, 8, 16, 32, 64) 16K (1, 2, 4, 8)	View Endpoint: Chat completions Capabilities: Function calling, JSON mode Custom checkpoints supported: Yes Optimizations: None	Model card
`Meta-Llama-3.1-405B-Instruct`	Text	Task agent Tool-calling agent Code generation	View 4K (1, 2, 4) 8K (1) 16K (1)	View Endpoint: Chat completions Capabilities: Function calling, JSON mode Custom checkpoints supported: No Optimizations: Speculative decoding	Model card
`Llama-4-Maverick-17B-128E-Instruct`	Image, Text	Image understanding Task agent Tool-calling agent	View 8K (1) 16K (1) 32K (1) 64K (1) 128K (1)	View Endpoint: Chat completions Capabilities: Function calling, JSON mode Custom checkpoints supported: No Optimizations: None	Model card
MiniMax
`MiniMax-M2.7`	Text	Coding agent	View 8K-32K (2, 4, 6, 8) 160K (2) 192K (2)	View Endpoint: Chat completions Capabilities: Function calling, Structured output Custom checkpoints supported: No Optimizations: None	Model card
`MiniMax-M2.5`	Text	Coding agent	View 8K-32K (2, 4, 6, 8) 160K (2) 192K (2)	View Endpoint: Chat completions Capabilities: Function calling, Structured output Custom checkpoints supported: No Optimizations: None	Model card
Mistral AI
`Mistral-Large-3-675B-Instruct-2512` §	Text	Multilingual instruction following Task agent Tool-calling agent	View 8K (1)	View Endpoint: Chat completions Capabilities: Function calling Custom checkpoints supported: No Optimizations: None	Model card
DeepSeek
`DeepSeek-R1-0528`	Reasoning, Text	Complex reasoning	View 4K (4) 8K (1) 16K (1) 32K (1)	View Endpoint: Chat completions Capabilities: Function calling, JSON mode Custom checkpoints supported: No Optimizations: None	Model card
`DeepSeek-R1-Distill-Llama-70B` ‡	Reasoning, Text	Complex reasoning	View 4K (1, 2, 4, 8, 16, 32) 8K (1, 2, 4, 8) 16K (1, 2, 4) 32K (1, 2, 4) 64K (1) 128K (1)	View Endpoint: Chat completions Capabilities: None Custom checkpoints supported: Yes Optimizations: Speculative decoding	Model card
`DeepSeek-V3-0324`	Text	Main/planner agent Tool-calling agent	View 4K (4) 8K (1, 4) 16K (1) 32K (1)	View Endpoint: Chat completions Capabilities: Function calling, JSON mode Custom checkpoints supported: No Optimizations: High-throughput	Model card
`DeepSeek-V3.1`	Reasoning, Text	Main/planner agent Tool-calling agent	View 4K (4) 8K (1, 4) 16K (1) 32K (1)	View Endpoint: Chat completions Capabilities: Function calling, JSON mode Custom checkpoints supported: No Optimizations: High-throughput	Model card
`DeepSeek-V3.2`	Text	Main/planner agent Tool-calling agent	View 8K (1, 4) 16K (1) 32K (1) 128K (1)	View Endpoint: Chat completions Capabilities: Optional thinking mode, Function calling, JSON mode Custom checkpoints supported: No Optimizations: High-throughput	Model card
OpenAI
`gpt-oss-120b`	Text	Main/planner agent Tool-calling agent	View 8K-32K (2, 4, 6, 8) 64K (2, 4) 128K (2)	View Endpoint: Chat completions Capabilities: Reasoning, Function calling, JSON mode, Logit masking, `logit_bias` sampling parameter (see logit_bias) Custom checkpoints supported: No Optimizations: None	Model card
`gpt-oss-20b`	Text	Main/planner agent Tool-calling agent Reasoning	View 8K-32K (2, 4, 6, 8) 64K (2, 4) 128K (2)	View Endpoint: Chat completions Capabilities: Function calling, JSON mode Custom checkpoints supported: No Optimizations: None	Model card
`Whisper-Large-v3`	Audio	Automatic speech recognition (ASR) Audio transcription	View 4K (1,16,32)	View Endpoint: Translation, Transcription Capabilities: None Custom checkpoints supported: No Optimizations: None	Model card
Google
`gemma-3-27b-it`	Text	Image understanding Task agent	View 4K-128K (2, 4, 6, 8)	View Endpoint: Chat completions Capabilities: Image understanding, JSON mode Custom checkpoints supported: No Optimizations: None	Model card
`gemma-3-12b-it`	Image, Text	Image understanding Task agent	View 128K (2, 4, 6, 8)	View Endpoint: Chat completions Capabilities: Image understanding, JSON mode Custom checkpoints supported: No Optimizations: None	Model card
`gemma-4-31B-it` §	Text	Task agent	View 128K (2, 4, 6, 8)	View Endpoint: Chat completions Capabilities: Function calling, JSON mode Custom checkpoints supported: No Optimizations: None	Model card
Alibaba Cloud
`Qwen3-235B-A22B-Instruct-2507`	Reasoning, Text	Agentic planner Multilingual instruction following	View 32K (2, 4, 6, 8) 128K (2)	View Endpoint: Chat completions Capabilities: None Custom checkpoints supported: No Optimizations: None	Model card
`Qwen3-32B`	Reasoning, Text	Task agent Multilingual instruction following	View 8K (1, 4) 16K (1) 32K (1, 2)	View Endpoint: Chat completions Capabilities: None Custom checkpoints supported: No Optimizations: None	Model card
Tokyotech-llm
`Llama-3.3-Swallow-70B-Instruct-v0.4` ‡	Text	Japanese instruction following Task agent	View 4K (1, 2, 4, 8, 16) 8K (1, 2, 4, 8, 16) 16K (1, 2, 4) 32K (1, 2, 4) 64K (1) 128K (1)	View Endpoint: Chat completions Capabilities: None Custom checkpoints supported: No Optimizations: Speculative decoding	Model card
Other
`E5-Mistral-7B-Instruct`	Embedding	Vector storage and retrieval (RAG)	View 4K (1, 4, 8, 16, 32)	View Endpoint: Embeddings Capabilities: None Custom checkpoints supported: No Optimizations: None	Model card

‡ No prebuilt bundle is available for this model in SambaStack. You can use this model to create a custom bundle.

Recommended model bundles

In SambaStack, models are not deployed individually; they are deployed as bundles. A bundle is a packaged deployment that groups one or more models together with their associated configurations, such as batch size and sequence length. For example, deploying the Meta‑Llama‑3.3‑70B model with a batch size of 4 and a sequence length of 16K tokens constitutes a single configuration. A bundle, however, can contain multiple such configurations, either for the same model or for different models. SambaNova’s RDU technology enables several models and configurations to be loaded simultaneously in a single deployment. This allows you to switch instantly between models and between batch‑/sequence‑size profiles as needed. In contrast to traditional GPU systems, where deployments are typically single‑model and static, SambaStack supports multi‑model, multi‑configuration bundles. This approach delivers higher efficiency, greater flexibility, and increased throughput while preserving low latency. You can run the following command to discover available bundles in your cluster:

kubectl -n <namespace> get bundles

The table below lists the recommended bundle templates for the models currently available in SambaStack. Each entry pairs a model with its recommended deployment bundle.

If the bundles listed below do not satisfy your inference requirements, you can create custom bundles that combine any mix of models and configurations so long as they fit in DDR memory.

Model name	Bundle template	Bundle description	Bundle configuration
`DeepSeek-R1-0528` / `DeepSeek-V3.1`	deepseek-r1-v31-fp8-16k	Medium context length with low batch size	View Models: `DeepSeek-R1-0528` Seq Length: 16K, BS: 1 `DeepSeek-V3.1` Seq Length: 16K, BS: 1
`DeepSeek-V3-0324`	deepseek-r1-v3-fp8-16k	Medium context length with low batch size	View Models: `DeepSeek-R1-0528` Seq Length: 16K, BS: 1 `DeepSeek-V3-0324` Seq Length: 16K, BS: 1
`E5-Mistral-7B-Instruct` / `Meta-Llama-3.1-8B-Instruct`	us-agentic-rag-1-1	Small to medium context length with varied batch size Speculative decoding supported for `Meta-Llama-3.3-70B`	View `gpt-oss-120b` Seq Length: 32K, BS: 4 Seq Length: 64K, BS: 2 Seq Length: 128K, BS: 2 `Llama-4-Maverick-17B-128E-Instruct` Seq Length: 8K, BS: 1 Seq Length: 16K, BS: 1 `Meta-Llama-3.3-70B` (Target)/ `Meta-Llama-3.2-1B` (Draft) Seq Length: 4K, BS: 1, 4, 8, 16, 32 Seq Length: 8K, BS: 1, 4, 8 Seq Length: 16K, BS: 1, 4 Seq Length: 32K, BS: 1, 4 Seq Length: 64K, BS: 1 Seq Length: 128K, BS: 1 `Meta-Llama-3.1-8B-Instruct` Seq Length: 4K, BS: 1, 4, 16, 32 Seq Length: 8K, BS: 1, 4, 16, 32 Seq Length: 16K, BS: 1, 4, 8 `E5-Mistral-7B-Instruct` Seq Length: 4K, BS: 1, 4, 8, 16, 32
`gemma-3-27b-it`	gemma3-27b-32-128k	Homogeneous bundle containing `gemma-3-27b-it` configurations.	View `gemma-3-27b-it` Seq Length: 32K, BS: 2, 4, 6, 8 Seq Length: 128K, BS: 2, 4, 6, 8
`gemma-3-12b-it`	gemma3-v3	Homogeneous bundle containing `gemma-3-12b-it` configurations. Large context length with medium batch size	View `gemma-3-12b-it` Seq Length: 128K, BS: 2, 4, 6, 8
`gemma-4-31B-it` §	gemma4-31B-128k	Homogeneous bundle containing `gemma-4-31B-it` configurations. Preview model — text-only.	View `gemma-4-31B-it` Seq Length: 128K, BS: 2, 4, 6, 8
`gpt-oss-120b`	`cd-dyt-gpt-oss-120b-32-64-128k` † `cd-dyt-gpt-oss-120b-8-32-64-128k` †	Homogeneous bundles with constrained decoding for `gpt-oss-120b`. `cd-dyt-gpt-oss-120b-8-32-64-128k` adds 8K context support. Refreshed PEFs include structured output (logit-masking) support.	View `cd-dyt-gpt-oss-120b-32-64-128k` Seq Length: 32K, BS: 2, 4, 6, 8 Seq Length: 64K, BS: 2, 4 Seq Length: 128K, BS: 2 `cd-dyt-gpt-oss-120b-8-32-64-128k` Seq Length: 8K, BS: 2, 4, 6, 8 Seq Length: 32K, BS: 2, 4, 6, 8 Seq Length: 64K, BS: 2, 4 Seq Length: 128K, BS: 2
`gpt-oss-20b`	`dyt-gpt-oss-20b-32-64-128k` †	Homogeneous bundle with constrained decoding for `gpt-oss-20b`.	View `dyt-gpt-oss-20b-32-64-128k` Seq Length: 8K, BS: 2, 4, 6, 8 Seq Length: 32K, BS: 2, 4, 6, 8 Seq Length: 64K, BS: 2, 4 Seq Length: 128K, BS: 2
`Llama-4-Maverick-17B-128E-Instruct`	llama-4-medium-8-16-32-64-128k	Homogeneous bundles containing `Llama-4-Maverick-17B-128E-Instruct` configurations. Small to large context length with low batch	View `Llama-4-Maverick-17B-128E-Instruct` Seq Length: 8K, BS: 1 Seq Length: 16K, BS: 1 Seq Length: 32K, BS: 1 Seq Length: 64K, BS: 1 Seq Length: 128K, BS: 1
`Meta-Llama-3.3-70B-Instruct`	70b-3dot3-ss-4-8-16-32-64-128k	Medium to large context length with low batch size	View Target Models: `Meta-Llama-3.3-70B-Instruct` Seq Length: 4K, BS: 2, 4, 8, 16, 32 Seq Length: 8K, BS: 2, 4, 8, 16, 32 Seq Length: 16K, BS: 1, 2, 4 Seq Length: 32K, BS: 1, 2, 4 Seq Length: 64K, BS: 1, 2, 4 Seq Length: 128K, BS: 1 Draft Models: `Meta-Llama-3.2-1B-Instruct` Seq Length: 4K, BS: 2, 4, 8, 16, 32 Seq Length: 8K, BS: 2, 4, 8, 16, 32 Seq Length: 16K, BS: 1, 2, 4 Seq Length: 32K, BS: 1, 2, 4 Seq Length: 64K, BS: 1, 2, 4; private: true Seq Length: 128K, BS: 1; private: true
`MiniMax-M2.5`	dyt-minimax-m2p5-32k dyt-minimax-m2p5-32-160k	Homogeneous bundles containing `MiniMax-M2.5` configurations. dyt-minimax-m2p5-32k is better for medium sequence lengths and high batching. dyt-minimax-m2p5-32-160k is better for higher sequence lengths and low batching	View dyt-minimax-m2p5-32k Seq Length: 4K-32K, BS: 2, 4, 6, 8 dyt-minimax-m2p5-32-160k Seq Length: 32K, BS: 2 Seq Length: 160K, BS: 2
`Mistral-Large-3-675B-Instruct-2512` §	mistral-large-3-fp8-8k	Homogeneous bundle containing `Mistral-Large-3-675B-Instruct-2512` configurations. Preview model — text-only.	View `Mistral-Large-3-675B-Instruct-2512` Seq Length: 8K, BS: 1
`Qwen3-235B-A22B-Instruct-2507`	dyt-qwen3-235b-32-128k	Homogeneous bundle containing `Qwen3-235B-A22B-Instruct-2507` configurations.	View `Qwen3-235B-A22B-Instruct-2507` Seq Length: 32K, BS: 2, 4, 6, 8 Seq Length: 128K, BS: 2
`Whisper-Large-v3` / `Qwen3-32B`	qwen3-32b-whisper-e5-mistral	Small to medium context length with varied batch size	View `E5-Mistral-7B-Instruct` Seq Length: 4K, BS: 1, 4, 8, 16, 32 `Qwen3-32B` Seq Length: 8K, BS: 1, 4 Seq Length: 16K, BS: 1 Seq Length: 32K, BS: 1, 2 `Whisper-Large-v3` BS: 1, 16, 32

† These bundles support sequence lengths from 8K–128K. If you require shorter context lengths (4K–16K) for gpt-oss-120b, contact your SambaNova representative. § Preview model in SambaStack 1.1.1 Text-only; vision support is in progress.

​Deployment options

​Supported models

​Recommended model bundles

Deployment options

Supported models

Recommended model bundles