Deployment options
When deploying models in SambaStack, administrators can select from various context length and batch size combinations.- Smaller batch sizes provide higher token throughput (tokens/second).
- Larger batch sizes provide better concurrency for multiple users.
Supported models
You can run the following command to discover available models in your cluster:| Developer/Model ID | Type | Suggested Use | Context length (batch size) | Features and optimizations | View on Hugging Face |
|---|---|---|---|---|---|
| Meta | |||||
Meta-Llama-3.3-70B-Instruct | Text |
| View
| View
| Model card |
Meta-Llama-3.1-8B-Instruct | Text |
| View
| View
| Model card |
Meta-Llama-3.1-405B-Instruct | Text |
| View
| View
| Model card |
Llama-4-Maverick-17B-128E-Instruct | Image, Text |
| View
| View
| Model card |
| MiniMax | |||||
MiniMax-M2.5 | Text |
| View
| View
| Model card |
| DeepSeek | |||||
DeepSeek-R1-0528 | Reasoning, Text |
| View
| View
| Model card |
DeepSeek-R1-Distill-Llama-70B | Reasoning, Text |
| View
| View
| Model card |
DeepSeek-V3-0324 | Text |
| View
| View
| Model card |
DeepSeek-V3.1 | Reasoning, Text |
| View
| View
| Model card |
| OpenAI | |||||
gpt-oss-120b | Text |
| View
| View
| Model card |
Whisper-Large-v3 | Audio |
| View
| View
| Model card |
| Qwen | |||||
Qwen3-32B | Reasoning, Text |
| View
| View
| Model card |
| Tokyotech-llm | |||||
Llama-3.3-Swallow-70B-Instruct-v0.4 | Text |
| View
| View
| Model card |
| Other | |||||
E5-Mistral-7B-Instruct | Embedding |
| View
| View
| Model card |
Recommended model bundles
In SambaStack, models are not deployed individually; they are deployed as bundles. A bundle is a packaged deployment that groups one or more models together with their associated configurations—such as batch size, sequence length, and precision settings. For example, deploying theMeta‑Llama‑3.3‑70B model with a batch size of 4 and a sequence length of 16K tokens constitutes a single configuration. A bundle, however, can contain multiple such configurations, either for the same model or for different models.
SambaNova’s RDU technology enables several models and configurations to be loaded simultaneously in a single deployment. This allows you to switch instantly between models and between batch‑/sequence‑size profiles as needed. In contrast to traditional GPU systems—where deployments are typically single‑model and static—SambaStack supports multi‑model, multi‑configuration bundles. This approach delivers higher efficiency, greater flexibility, and increased throughput while preserving low latency.
You can run the following command to discover available bundles in your cluster:
If the bundles listed below do not satisfy your inference requirements, you can create custom bundles that combine any mix of models and configurations so long as they fit in DDR memory.
| Model name | Bundle template | Bundle description | Bundle configuration |
|---|---|---|---|
MiniMax-M2.5 |
|
| View
|
Meta-Llama-3.3-70B-Instruct | 70b-3dot3-ss-4-8-16-32-64-128k |
| ViewTarget Models:
Draft Models:
|
gpt-oss-120b | dyt-gpt-oss-120b-32-64-128k |
| View
|
DeepSeek-R1-0528 / DeepSeek-V3.1 | deepseek-r1-v31-fp8-16k |
| ViewModels:
|
DeepSeek-V3-0324 | deepseek-r1-v3-fp8-16k |
| ViewModels:
|
Llama-4-Maverick-17B-128E-Instruct | llama-4-medium-8-16-32-64-128k |
| View
|
Whisper-Large-v3 / Qwen3-32B | qwen3-32b-whisper-e5-mistral |
| View
|
E5-Mistral-7B-Instruct / Meta-Llama-3.1-8B-Instruct | us-agentic-rag-1-1 |
| View
|
