SambaStack Release Notes - SambaNova Documentation

Release notes for SambaStack, including new features, enhancements, and fixes.

Using SambaWiz to build bundles? Always pull the latest SambaWiz version with each release so you have access to the models and features it introduces.

SambaStack v1.2.0 release

Release Date: July 7, 2026 This release introduces the following key updates:

Text-to-Speech / Audio API – New /v1/audio/speech endpoint
Prompt Caching – Reduced cost and latency for requests sharing a common prefix
Mistral Large 3 675B – Function calling support, context length extended to 32K
MiniMax M2.7 – New prompt-caching bundles
Gemma 4 31B – Video understanding up to 128K context, or text-only up to 256K context (separate bundles, not combined)
A substantial set of security, authentication, and API improvements

For full deployment details, bundle configurations, and context length options for all models and bundles mentioned below, see Supported models and bundles.

New features and enhancements

Text-to-Speech / Audio API SambaStack now supports audio speech generation via a new /v1/audio/speech endpoint, backed by the Qwen3-TTS model family (talker + vocoder).

Generate speech from text following the OpenAI audio speech API shape.
Optional language parameter with validation – unsupported languages return a 400 with the list of supported options.
Optional sampling_params object (temperature, top_p, top_k, seed) to control generation style.

Prompt Caching Prompt caching reduces cost and latency for requests that share a common prefix (long system prompts, document Q&A, multi-turn conversations).

The pricing object in /v1/models now includes input_cache_read and input_cache_write fields showing per-token cache pricing.
The usage object in responses reports cached_tokens separately from input_tokens.
MiniMax M2.7 prompt-caching bundles are now available in SambaStack.

Model updates Added support for the following models in SambaStack 1.2.0:

Mistral Large 3 675B (Mistral-Large-3-675B-Instruct-2512) – Function calling support added and context length support extended to 32K.
MiniMax M2.7 (MiniMax-M2.7) – New prompt-caching bundles with default sampling enabled.
Gemma 4 31B (gemma-4-31B-it) – Now supports video understanding up to 128K context via the gemma-4-31b-32-128k bundle, or text-only context extended to 256K via the separate cd-gemma-4-31b-32-128-256k bundle. The two are not combined – the 256K bundle is text-only and does not support image or video input.
Qwen3-TTS (qwen3-tts) – Talker and vocoder models for the new /v1/audio/speech endpoint.

Model CRs now public All Model Custom Resources are now rendered with public: true in the SambaStack chart, making them visible via API discovery without requiring operator intervention.

Model-level fields Added support for model-level fields defined in Model Custom Resources. These fields apply automatically to every deployment of that model, so you no longer need to set them in bundles.

Standalone authentication service The authentication service can now be deployed independently from the main backend, enabling finer-grained scaling and upgrade control.

API improvements and fixes

Streaming fix: tool-call content Tool-call requests were fully buffered instead of streamed, affecting all deployed model families. This release restores true incremental streaming for tool-call requests.

Messages API (/v1/messages) fixes

stop_reason – Now correctly returns "tool_use" when the model invokes a tool.
disable_parallel_tool_use – Now correctly enforced.
role: "system" – The messages[] array now accepts role: "system" for MCP client compatibility.
Auth error shape – Authentication errors now return the Anthropic-standard error envelope.

ignore_eos parameter Requests can now pass ignore_eos: true to continue generation past the model’s end-of-sequence token. That feature is disabled by default for security reasons and needs to be enabled in the bundle deployment.

response_format fix for reasoning models Structured output parsing now uses reasoning-stripped completion text, preventing <think> tokens from corrupting JSON extraction on Gemma 4 and other reasoning models.

Bug fixes

Service-tier defaults – Fixed an issue where some models weren’t included in any service tier by default, requiring manual setup before they could serve requests. All models are now included by default.
Vision model path substitution – Fixed an issue with path substitutions for vision models.
Gemma 4 tool calling in custom bundles – Fixed an issue causing Gemma 4 to return malformed tool calling responses in custom bundles.
Multimodal pipeline – Fixed the reasoning field being dropped during multimodal message preprocessing.
Thinking prefix – Fixed spurious thinking prefixes being emitted on responses that don’t require them.
Liveness probe – Liveness probes are now kept alive during long single-request generation runs, preventing false-positive restarts when the model is actively generating a response.
Qwen3-TTS audio artifacts – Upgraded PyTorch 2.4 → 2.7 to eliminate blast noise on generated audio.
Invalid bundle resources – Fixed default bundle resource specs where hugepages were specified without cpu/memory limits.
Tiktoken cache for airgap – Tiktoken vocabulary files are now cached for airgapped GPT-OSS deployments.

Known issues

Constrained decoding on Gemma 4 31B – Only available with text-only input.
Qwen3-TTS mid-stream audio artifacts – Intermittent audio glitches occur in clips longer than approximately 60 seconds. For lengthy synthesis jobs, artifact frequency increases with clip length. Clips under 60 seconds are generally unaffected.
Qwen3-TTS long-input audio degradation – For text inputs of approximately 1,000 words or more, audio quality may degrade mid-generation, in some cases collapsing to near-silence or garbled output. Keep individual TTS requests under 800 words for best results; split longer content into multiple requests.
Qwen3-TTS missing voice validation – Requests with an invalid or missing voice field return HTTP 200 with audio output instead of a 4xx error.
DeepSeek-V3.2 response_format – Structured output requests (response_format: json_object or json_schema) may fail for DeepSeek-V3.2. This is an ongoing regression from the previous release.
70b-3dot3-ss-full-whisper fails to deploy – This bundle fails to start due to a recurring startup probe timeout. Redeploy attempts do not resolve the issue.

SambaStack v1.1.1 release

Release Date: May 27, 2026 This release introduces the Anthropic Messages API (/v1/messages), four new SambaStack models (Mistral-Large-3-675B-Instruct-2512, gpt-oss-20b, gemma-4-31B-it, MiniMax-M2.7), migrates bundle configuration into PEF Custom Resources, and remediates several outstanding CVEs.

For full deployment details, bundle configurations, and context length options for all models and bundles mentioned below, see Supported models and bundles.

Backwards-incompatible changes.This release removes the resubmit bundle mechanism and renames model Qwen3-235B to its full name Qwen3-235B-A22B-Instruct-2507. See the inline warnings in New models and Bundle configuration changes before upgrading.

New features and enhancements

Anthropic Messages API (/v1/messages) A new Anthropic-compatible Messages endpoint is now available. SambaStack operators can now serve Anthropic-formatted traffic directly, without requiring translation to the OpenAI API format.

New models Added support for the following models in SambaStack 1.1.1:

Mistral Large 3 675B (Mistral-Large-3-675B-Instruct-2512)
gpt-oss 20B (gpt-oss-20b)
Gemma 4 31B (gemma-4-31B-it) - with function calling and reasoning-parser support
MiniMax M2.7 (MiniMax-M2.7)

Qwen3-235B renamed to Qwen3-235B-A22B-Instruct-2507. Update client model names from Qwen3-235B* to Qwen3-235B-A22B-Instruct-2507.1B draft model experts marked private: true. Confirm no client code addresses these experts as public models.

New PEFs

PEF-1914 (gpt-oss-fp8-ss131072-bs8-dyt-1-cd:2) for the existing gpt-oss-120b model - more performant version of gpt-oss-fp8-ss131072-bs8-dyt-1-cd with added support for logit_bias.
PEF-1910 (minimax-m2p5-ss196608-bs2-dyt-1-cd:1) provides up to 192k context length with structured output support for both MiniMax-M2.5 and MiniMax-M2.7.

Bundle configuration changes This release continues moving deployment configuration from bundles into PEF Custom Resources.

Checkpoint sharing is now configured automatically based on PEF CRs. checkpoint_sharing_uuid is always sourced from the corresponding PEF CR; prebuilt bundles no longer override it, and user-defined bundles should not override it either.
continuous_batching and constrained_decoding are now configured on PEF CRs rather than bundles.
apply_default_sampling_params added as an expert-level field in BundleTemplates. When set to True, sampling-parameter values are populated from the model’s generation_config.json.

Resubmit mechanism removed from bundles. Remove resubmit_to, resubmit_tool, and enableResubmit from custom bundles.

API improvements and fixes

logit_bias parameter The logit_bias parameter is now supported in chat completions.

Accepts a JSON object mapping token IDs to bias values in the range -100 to 100, matching the OpenAI specification.
Applies to text-generation models only - not supported for high throughput models or multimodal models.

response_format accepts null The API-server validator now accepts null for response_format, matching OpenAI’s looser shape.

Invalid request error formats improved Error messages on bad input are clearer, and error responses now align with the OpenAI standard. http error was renamed to api http error for cross-platform consistency. Example response for an out-of-range logit_bias token ID:

{
  "error": {
    "message": "Token id 99999999 is out of the vocabulary range.",
    "type": "invalid_request_error",
    "param": "logit_bias",
    "code": "invalid_value"
  },
  "request_id": "d7tfhubndjoecjv2ao50"
}

Bug fixes

modelSpecs overrides handling modelSpecs overrides now correctly apply zero-value fields (empty lists, scalar false, empty strings, zero ints). Previously these were silently skipped during the helm merge. For example:

configs:
  modelSpecs:
  - "*"
  - name: example-model
    metadata:
      category: []          # empty list now applied
      provider: ""          # empty string now applied
    price:
      input_tokens: 0       # zero int now applied
      output_tokens: 0
    public: false           # scalar bool now applied

mm_token_type_ids handling mm_token_type_ids added by transformers 5.5.3 is now handled correctly.

CVE remediation This release includes fixes for multiple critical CVEs and security hardening improvements. Specific CVE references will be added after vulnerability-scan review.

Known issues

Mistral Large 3 675B (Mistral-Large-3-675B-Instruct-2512) currently has issues with function calling accuracy.
gpt-oss-20b and gpt-oss-120b will fail to deploy in airgapped evironments. Please contact your SambaNova admin for assistance with deploying these models in airgapped environments.

SambaStack v1.0.57 release

Release Date: April 30, 2026 This release introduces new model support (Gemma 3 12B, Gemma 3 27B, DeepSeek V3.2), high-throughput configurations for DeepSeek models, constrained decoding support for GPT-OSS, more accurate and informative bundle legalizer responses, and major API additions including the OpenAI Responses API, the n and seed parameters, improved TTFT measurement, and OpenAI-conformant error responses.

For full deployment details, bundle configurations, and context length options for all models and bundles mentioned below, see Supported models and bundles.

Disruptive release - downtime required to upgrade.Allow approximately 1 hour per node when planning your maintenance window.

Single SambaRack SN40L-16: Full system downtime required for the duration of the upgrade.
Multiple SambaRack SN40L-16 nodes running the same bundle: Rolling upgrade supported - apply nodes sequentially to maintain service at reduced capacity.

New features and enhancements

New models Added support for the following models in SambaStack 1.0.57:

Gemma 3 12B (gemma-3-12b-it) - image understanding
- gemma3-v3: 128K context, BS 2/4/6/8
Gemma 3 27B (gemma-3-27b-it) - image understanding
- gemma3-27b-32-128k: 32K and 128K context, BS 2/4/6/8
Qwen3 235B (Qwen3-235B-A22B-Instruct-2507) - the legacy Qwen3-235B model name is preserved for backward compatibility
- dyt-qwen3-235b-32-128k: 32K context (BS 2/4/6/8), 128K context (BS 2)
- qwen3-235b-16-32-64k: 16K, 32K, 64K context
- qwen3-235b-128k: 128K context
DeepSeek V3.2 - available in high-interactivity configurations (up to 128k context) and high-throughput configurations (up to 32k context)
GPT-OSS 120B - adds constrained decoding capability; the previous standard DYT bundle is replaced by two bundles with constrained decoding support enabled
- cd-dyt-gpt-oss-120b-32-64-128k: 32K, 64K, 128K context
- cd-dyt-gpt-oss-120b-8-32-64-128k: 8K, 32K, 64K, 128K context

High-throughput configurations SambaStack 1.0.57 introduces high-throughput configuration options for running on SambaRack nodes.

Optimized for large-scale serving of a single model, prioritizing total system throughput over per-user latency to support high volumes of concurrent users
Suited for use cases that do not require low end-to-end latency or interactivity
Requires a minimum of 4 SambaRack nodes dedicated to this configuration
Cannot be bundled with other models
Transparent to users - no API or client code changes required
Uses a disaggregated prefill-decode architecture with a configurable node ratio - for example, 3 nodes running prefill and 1 running decode

Supported models: DeepSeek-R1, DeepSeek-V3-0324, DeepSeek-V3.1, DeepSeek-V3.1-Terminus, and DeepSeek-V3.2. See Supported models and bundles. Use the following log metrics to monitor high-throughput deployments:

decode_queue_time - time requests wait in the decode queue before processing begins (new in 1.0.57)
time_to_first_token - latency from request receipt to first output token
completion_tokens_per_sec - decode throughput

Constrained decoding (structured output) Added constrained decode mask sampling on a per-token schedule. Models that declare supports_constrained_decoding in their PEF CRs can now use structured output generation with JSON schema enforcement.

Set constrained_decoding: true in the BundleTemplate to enable this feature for the bundle.

TTFT measurement improvement process_request() processing time is now included in TTFT and end-to-end latency measurements for more accurate reporting.

Bundle legalizer: accuracy and validation improvements Improved legalizer accuracy for memory accounting; bundles that exceed available memory or host segment size are now rejected at validation time rather than failing at runtime.

Bundle memory utilization in bundle CR status The Bundle CR status now includes a legalizerInfo block with memory utilization data from the bundle legalizer. Use kubectl get bundle <bundle-name> -o yaml to inspect the block.

Field	Description
`status`	Legalizer validation result: `passed`, `failed`, or `skipped`
`errors`	List of validation errors (present when `status: failed`)
`warnings`	Non-fatal warnings (present when `status: passed` with warnings)
`ddr`	DDR memory utilization ratio
`hbm_resident`	HBM resident memory utilization; can exceed 1.0 under over-allocation
`host`	Host memory utilization ratio

The status field is absent if the legalizer output could not be processed. skipped only appears when skip_legalizer: true is set on a PEF CR.

PEF and checkpoint lifecycle status PEF CRs now include a pef_status field, and checkpoint CR version entries include a checkpoint_status field, giving operators visibility into artifact lifecycle state without needing to inspect pod logs. Status values for both fields:

preview - newly available configurations which may not be fully tested and / or may not have full feature support; not recommended for production use cases
stable - well-tested, production-ready configurations
deprecated - still functional; scheduled for removal in a future release
removed - no longer available; must be replaced before deployment

Check pef_status when reviewing PEF CR versions before deploying a bundle to confirm the version is stable.

API improvements and fixes

Enhancements to improve OpenAI compatibility and new API capabilities across the chat/completions endpoint and a new responses endpoint. Responses API The OpenAI-compatible Responses API (POST /v1/responses) is now supported.

Currently supported model: gpt-oss-120b
Supported capabilities: text generation (streaming and non-streaming), function calling (2-step), structured output (JSON schema), multi-turn conversations (client-managed state), and reasoning output
Uses stateless request semantics - conversation history is passed by the client on each request

reasoning_tokens in usage response Reasoning models now include a reasoning_tokens field in the usage object, reporting the number of tokens consumed by the model’s internal reasoning step.

tool_choice support for GPT-OSS 120B tool_choice is now supported for gpt-oss-120b. Accepted values: auto, none, required, and {"type": "function", "function": {"name": "..."}}.

Only gpt-oss-120b supports tool_choice in Release 1.0.57
The forced function call format follows the Chat Completions API structure - the inner "function" key is required; this differs from the Responses API format which omits it
allowed_tools is not supported

tool_choice is supported only for gpt-oss-120b bundles with constrained decoding enabled: cd-dyt-gpt-oss-120b-32-64-128k and cd-dyt-gpt-oss-120b-8-32-64-128k.

n parameter - multiple completions The n parameter is now supported in chat completions.

Valid range: 1–16 (default: 1)
Implemented via API-level decomposition - n parallel single-completion requests are issued and combined before returning to the client
Not supported when tools or functions are present in the request

seed parameter The seed parameter is now supported for reproducible outputs.

Accepts any integer, including negative values
Applies to text generation models only - not supported for multi-modal or continuous batching models
system_fingerprint is not returned (unlike OpenAI)

OpenAI-conformant error responses All API error responses now use the OpenAI-standard error format. A new top-level request_id field is included in every error response.

{
  "request_id": "abc-123",
  "error": {
    "message": "...",
    "type": "...",
    "param": null,
    "code": "..."
  }
}

Provide request_id to SambaNova support when reporting an issue.

Structured output parser fixes Fixed bugs in structured output that produced incorrect output for certain schema patterns.

Better error message for tool call truncation Improved the error message when a tool call exceeds the maximum token length.

Empty streaming chunk removed Removed a spurious empty chunk emitted during streaming responses, improving conformance with OpenAI streaming behavior.

Whisper rate limit error code fix Fixed the Whisper transcription endpoint to return the correct 429 status code when the rate limit is exceeded.

Bug fixes

PEF cache fix Fixed a bug with PEF cache when migrating Bundles to use PEF CRs

Auth provider validation fix Fixed a Helm auth provider validation issue that incorrectly rejected custom secrets when Keycloak was also enabled.

CVE dependency updates Updated multiple dependency versions across the inference operator, global model router, and supporting libraries to address known CVEs.

Known issues

Gemma 3: function calling not supported Gemma 3 12B and 27B do not support native function calling. The toolSupport: true flag in bundles using Gemma models indicates support for JSON output schema, rather than general native function calling support.

Impact: Requests using the tools parameter with gemma-3-12b-it will not produce function call outputs.
Workaround: Function calling behavior can be approximated by implementing tool-use logic via user prompts.

Gemma 3 27B: vision mode instability at 104k–112k context length Vision requests to gemma-3-27b-it may experience intermittent errors.

Impact: Vision requests to gemma-3-27b-it at 104k–112k token context lengths may experience intermittent HTTP 524 timeout errors with elevated latency. Text and function calling modes are not affected.
Workaround: Avoid vision requests in the 104k–112k context window range for now. We are looking in to a fix.

SambaStack v0.5.17 release

Release Date: April 8, 2026 This release introduces support for SambaRack SN40L-16 configuration with 4TB of DDR memory.

New features and enhancements

Cluster-level memory management Adds support for declaring cluster-wide DDR memory limits for SambaRack SN40L-16 via an environment variable in sambastack.yaml, enforced by the bundle validation tool.

The default memory limit is 12TB. Update this value to support SambaRack SN40L-16 with 4TB of DDR.
Set the DDR_PER_RDU_GB environment variable: default is 768 (12TB per-node), set to 256 for 4TB per-node. For more details, see the SambaStack.yaml reference
The memory limit applies to all SambaRack nodes in the cluster.
The bundle validation tool enforces the limit at runtime. Configurations exceeding the limit fail with an informative error and must be refactored by removing model configurations.

Known issues

Inventory check shows degraded status for 4TB memory configurations Running snfadm inventory shows “degraded” status for RDUs that are in nodes with 4TB memory configurations.

Impact: Cosmetic only. Does not affect operation.
Resolution: Expected to be resolved in a future release.

SambaStack v0.5.14 release

Release Date: April 1, 2026 This release introduces simplified checkpoint discovery, new model support (MiniMax-M2.5, Agentic RAG bundle), enhanced installation verification tools, and multiple API enhancements for improved OpenAI compatibility.

For full deployment details, bundle configurations, and context length options for all models and bundles mentioned below, see Supported models and bundles.

New features and enhancements

Checkpoint path discovery via model CRs Checkpoint paths are now discoverable through the Model Custom Resource (CR), eliminating the need for customers to manually locate checkpoint paths in configuration files. The following Kubernetes command can be used to view Model CRs, which now contain checkpoint paths:

kubectl -n <namespace> describe model <model-name>

For example:

kubectl describe model minimax-m2.5

Model CRs now include checkpoint path information for all supported models.
Supports multiple checkpoints for different model configurations.
Backwards compatible with existing bundle configurations - checkpoint paths in bundle CRs override Model CR paths if specified.
Works with on-prem and air-gapped deployments.

Improved discovery of available models and bundles All available models and bundles are now applied automatically by the SambaStack helm chart. No additional configuration is required.

Models can be discovered using kubectl -n <namespace> get models.
Bundles can be discovered using kubectl -n <namespace> get bundles.

Agentic RAG bundle Added the us-agentic-rag-1-1 bundle, a multi-model bundle optimized for retrieval-augmented generation (RAG) workflows. It contains the following model configs:

gpt-oss-120b
- Seq Length: 32K, BS: 4
- Seq Length: 64K, BS: 2
- Seq Length: 128K, BS: 2
Llama-4-Maverick-17B-128E-Instruct
- Seq Length: 8K, BS: 1
- Seq Length: 16K, BS: 1
Meta-Llama-3.3-70B (Target) / Meta-Llama-3.2-1B (Draft)
- Seq Length: 4K, BS: 1, 4, 8, 16, 32
- Seq Length: 8K, BS: 1, 4, 8
- Seq Length: 16K, BS: 1, 4
- Seq Length: 32K, BS: 1, 4
- Seq Length: 64K, BS: 1
- Seq Length: 128K, BS: 1
Meta-Llama-3.1-8B-Instruct
- Seq Length: 4K, BS: 1, 4, 16, 32
- Seq Length: 8K, BS: 1, 4, 16, 32
- Seq Length: 16K, BS: 1, 4, 8
E5-Mistral-7B-Instruct
- Seq Length: 4K, BS: 1, 4, 8, 16, 32

MiniMax-M2.5 model support Added support for MiniMax-M2.5 model on SambaStack.

Checkpoint accessible via your artifact reader service account.
Customers can include MiniMax-M2.5 in bundles that pass bundle validation and deploy successfully.
Includes reasoning support.

Pre/post-install verification scripts New verification scripts help customers validate their SambaStack environment before and after installation.

Pre-install script: Validates all hardware, connectivity, and software prerequisites.
Post-install script: Confirms all SambaStack components are installed and running correctly.
Clear pass/fail reporting with actionable guidance on failures.
Scripts maintained and validated against the current SambaStack release.
Distributed via the sambastack-tools public GitHub repository with README instructions.

Per-model queue depth configuration Added support for configuring different queue depths based on context length, enabling optimized uptime for high-context-length requests.

Queue depths can now be configured per context length group using contextGroups in the Service Tier configuration.
Queue depth controls how many concurrent requests can be queued for a model configuration.
Lower queue depths for higher context lengths help prevent memory exhaustion and improve overall service stability.
SambaStack now validates queue depth configuration at request time. Misconfigured models with missing queue depth definitions will surface a clear error instead of failing silently.
The empty string "" in contextLengths matches requests to the base model name without a context length suffix (e.g., DeepSeek-R1-0528). Requests with explicit suffixes like -8k or -128k match their corresponding contextLengths values.

Context length suffixes (8k, 16k, 32k, etc.) are case-sensitive. Use lowercase k in all configurations. This applies to all models supported by SambaNova.

The contextGroups field is a sub-component of a model grouping within a service tier. Example configuration:

free:
  - models:
      - DeepSeek-V3-0324
      - DeepSeek-R1-0528
    queueDepth: 10
    qos: free
    rates:
      - allowedRequests: 100
        periodSeconds: 60
example_service_tier:
  inherits: free
  overrides:
    - models:
        - DeepSeek-V3-0324
      queueDepth: 10
      qos: example
      rates:
        - allowedRequests: 100
          periodSeconds: 60
    - models:
        - DeepSeek-R1-0528
      queueDepth: 10
      qos: example
      contextGroups:
        - contextLengths: ["", "8k"]
          queueDepth: 2
        - contextLengths: ["16k", "32k", "64k", "128k"]
          queueDepth: 1
      rates:
        - allowedRequests: 100
          periodSeconds: 60

Helm chart configuration update The substitutions field has moved from bundles to global in the SambaStack Helm chart.

This is a breaking change that affects air-gapped and NFS customers. Update your Helm values file before upgrading.

Before:

bundles:
  substitutions:
    gs://<SAMBASTACK_ARTIFACTS_BUCKET>: nfs:///nfsdata

After:

global:
  substitutions:
    gs://<SAMBASTACK_ARTIFACTS_BUCKET>: nfs:///nfsdata

API improvements and fixes

Enhancements to improve OpenAI compatibility across the chat/completions endpoint and a new, non-standard feature to track usage in streaming chunks. Text object support in user message content

Expanded support for text objects in content arrays, matching OpenAI ChatCompletionsContentPartText specification.
Enabled for: DeepSeek-R1-0528, Llama-3.3-Swallow-70B-Instruct-v0.4, and MiniMax-M2.5.

Log probabilities

Added logprobs field that, when set to true, returns the log probabilities for each generated token.
Added top_logprobs field that, when set to an integer n, returns the top n log probabilities for each generation.

Usage in streaming chunks

Added a non-standard feature to allow users to obtain partial usage statistics in chunks returned in streaming responses.
This feature is enabled by setting STREAM_USAGE_IN_CHUNKS: true in the replica group section of your custom bundle deployment.

Tool choices Supported only for models that support function calling.

tool_choice: none ensures that the model will not see available tools.

Invalid message role validation

The chat/completions endpoint now rejects invalid message roles. Only user, assistant, system, and tool are accepted.

Whisper audio error response

The Whisper transcription endpoint now returns descriptive error messages when audio file processing fails, instead of a bare HTTP 400 status code.

Bug fixes

Function calling routing fix

Fixed an issue where function calling routing did not apply the model name prefix check correctly, causing some models to skip tool routing.

Air-gap inventory fix

Fixed the air-gap inventory to include the correct cloudnative-pg image configuration, preventing missing image errors during offline installation.

Known issues

SambaRack Manager does not support 2 PDU configurations SambaRack Manager does not currently support configurations with 2 PDUs. Customers using 2 PDU setups should contact SambaNova Support for guidance on alternative configurations.

SambaStack v0.4.8 release

Release Date: March 10, 2026 This release introduces air-gapped deployment support, custom checkpoint management with NFS storage, swappable model configurations, and multiple API enhancements for improved OpenAI compatibility.

New features and enhancements

SambaStack air-gapped support Added support for air-gapped mode of operation, enabling secure, isolated deployments.

Install, upgrade, and setup for air-gapped configurations is performed in conjunction with SambaNova support.
Ongoing administration (Auth, User Management, Custom DB) is designed for self-service and follows the same workflows as on-prem deployments.

Install, setup, port forwarding to access Keycloak UI, and upgrade steps are not documented for air-gapped deployments due to varying customer network configurations. Please work with SambaNova support for these workflows.

Custom checkpoints with NFS storage Added the ability to reference custom checkpoints from customer-provided NFS storage in deployments. Swappable models in bundles Added configurable model swapping behavior to optimize high-bandwidth memory (HBM) utilization.

By default, all models in bundles can be swapped out of HBM and replaced with other models in DDR memory.
Use the swappable: <boolean> field in the bundle YAML definition to enable or disable this behavior.
Default value is true. When set to false, the model remains in HBM and cannot be swapped out, ensuring zero switching time for requests to that model.

API improvements and fixes

Enhancements to improve OpenAI compatibility across the chat/completions endpoint. Text object support in user message content

Added support for text objects in content arrays, matching OpenAI ChatCompletionsContentPartText specification.
Enabled for: gpt-oss-120b, DeepSeek-V3.1, DeepSeek-V3.1-Terminus, DeepSeek-V3.2, DeepSeek-V3-0324, Qwen3-32B, Qwen3-235B.

Response format text option

Fixed an issue where response_format=text would throw an error.
The endpoint now supports all OpenAI formats: text, json_object, json_schema.

Extended temperature range

Expanded temperature range from 0.0–1.0 to 0.0–2.0, matching OpenAI specification.

Tool calling number type fix

Tools with number-type arguments were always returned as floats.
Now integers are preserved as integers, matching JSON Schema number specification.

Parallel tool calls support

Added parallel_tool_calls parameter support.
When set to false, the model will make at most one tool call per response, matching OpenAI specification.

Streaming token usage reporting

Added support for token usage reporting in each chunk of stream.

Known issues

Parallel Tool Calls with Constrained Decoding.
The following models return null for logprobs even when logprobs=true or top_logprobs is set. The parameters are accepted without error but have no effect:
- Llama-4-Maverick-17B-128E-Instruct
- Whisper-Large-v3

SambaStack initial release

Release Date: September 19, 2025 This release introduces the comprehensive SambaStack documentation suite.

New features and enhancements

SambaStack guide Added the SambaStack Guide, providing step-by-step instructions for deploying, configuring, and managing SambaStack.

Setup, installation, and environment configuration.
User and authentication management (Keycloak, OIDC).
Monitoring, logging, and artifact management.
Bundle and model deployment workflows.
Common command reference.

SambaStack models Added the SambaStack models and bundles page to help customers understand which models are available on SambaStack and how to configure them.

Lists all supported models (e.g., Llama 3.3, Llama 4 Maverick, DeepSeek).
Shows context length, batch size options, and supported features.
Instructions for using the Model list API to check availability in your environment.

​SambaStack v1.2.0 release

​New features and enhancements

​API improvements and fixes

​Bug fixes

​Known issues

​SambaStack v1.1.1 release

​New features and enhancements

​API improvements and fixes

​Bug fixes

​Known issues

​SambaStack v1.0.57 release

​New features and enhancements

​API improvements and fixes

​Bug fixes

​Known issues

​SambaStack v0.5.17 release

​New features and enhancements

​Known issues

​SambaStack v0.5.14 release

​New features and enhancements

​API improvements and fixes

​Bug fixes

​Known issues

​SambaStack v0.4.8 release

​New features and enhancements

​API improvements and fixes

​Known issues

​SambaStack initial release

​New features and enhancements

SambaStack v1.2.0 release

New features and enhancements

API improvements and fixes

Bug fixes

Known issues

SambaStack v1.1.1 release

New features and enhancements

API improvements and fixes

Bug fixes

Known issues

SambaStack v1.0.57 release

New features and enhancements

API improvements and fixes

Bug fixes

Known issues

SambaStack v0.5.17 release

New features and enhancements

Known issues

SambaStack v0.5.14 release

New features and enhancements

API improvements and fixes

Bug fixes

Known issues

SambaStack v0.4.8 release

New features and enhancements

API improvements and fixes

Known issues

SambaStack initial release

New features and enhancements