SambaStack glossary - SambaNova Documentation

Hardware

BMC (Baseboard Management Controller) A baseboard management controller (BMC) is a specialized service processor that remotely monitors and manages the physical state of a system, such as an XRDU or host. Core Services Rack (CSR) A rack containing switching and management infrastructure for a SambaRack deployment. Exact contents will depend on site specifics and product. RDU (Reconfigurable Dataflow Unit) SambaNova’s unique AI processor. RDUs leverage dataflow to create an assembly line of operations that eliminates the memory bottlenecks faced by other solutions to make AI faster, more efficient, and more scalable. RDUs are dynamically reconfigurable, enabling multiple models and configurations to run from a single SambaRack system and provide fast switching between models. Learn more Dataflow Architecture. RDU-C (RDU Controller) A management component within each XRDU module that controls RDU operations alongside the BMC. SambaRack node A single SambaNova system built with Reconfigurable Dataflow Units (RDUs). SambaRack SN40L-16 The fourth-generation SambaNova system, built with 16 SN40L RDUs across eight XRDU modules. SN40L-H host module The host module is an x86-based server running Red Hat Enterprise Linux that manages SambaRack node operations and provides the primary interface to the rack’s compute resources. XRDU A generation agnostic term for hardware modules that house RDUs, memory, network interfaces and other components. Eight XRDU modules comprise a single SambaRack SN40L-16 node, for a total of 16 RDUs per node.

Platform and software

SNFM (SambaNova Fault Management) A framework for reporting, diagnosing, and analyzing system error and fault events on SambaRack systems. SambaNova Runtime Software installed on SN40L-H host modules responsible for executing inference requests, transferring data, managing hardware components, and handling errors and resource allocation. SambaRack Manager A tool for centralized hardware administration of on-premises SambaRack deployments, enabling operations at rack, node, device, and group levels. SambaRack Manager is not used for administering Kubernetes or SambaStack software.

Deployment and models

Bundle deployment A set of models deployed to a SambaRack node. The same deployment can be made to one or more nodes, however each node can only run one deployment at a time. A SambaStack instance can contain multiple deployments. Checkpoint A saved snapshot of a model’s state, consisting primarily of the model’s weights, along with a config file describing the architecture and tokenizer files for handling text input/output. Custom checkpoint User-provided fine-tuned or custom model weights that are converted using the Checkpoint Conversion Tool and then deployed as part of a model bundle. Inference cache pod A Kubernetes pod that pre-loads and caches model artifacts (PEFs and checkpoints) to reduce model loading times during inference. Legalizer A validation process that verifies a model bundle configuration fits within SambaRack node DDR memory constraints before deployment proceeds. Model bundle A group of model configurations combined to run in a deployment. Bundles consist of specific configurations (PEFs) and checkpoints, and enable fast switching between models and configurations without reloading weights. Model manifest A Kubernetes resource definition that registers a checkpoint with SambaStack, including serving name, owner, aliases, and tokenizer configuration used to identify the model in API requests. PEF A compiled model binary optimized to run on RDUs that represents a specific model configuration or configurations.

Inference and performance

Batch size The number of requests processed concurrently in a single inference pass. Smaller batch sizes yield higher per-request token throughput; larger batch sizes improve concurrency for multiple simultaneous users. Expert A model instance bound to a specific sequence length profile (such as 4K, 8K, 16K, 32K, or 128K) within a deployed bundle. An expert paired with a specific batch size defines the complete runtime configuration for model execution at a given sequence length and concurrency level. Gateway The API gateway microservice that receives incoming user inference requests and routes them to the inference router for processing and load balancing. Inference operator A Kubernetes controller that orchestrates the lifecycle of the inference engine and inference cache pods, and serves as the central model database providing model metadata and service tier information to other components. Inference router A microservice that manages request batching and routing, distributing inference workloads across available inference pods based on model availability and queue state. Sequence length The maximum number of input and output tokens the model can process and generate in a single request. Common values include 4K, 8K, 16K, 32K, 64K, and 128K tokens. Inference Engine A microservice that processes model inference requests for a specific deployed model. An inference engine can be in one of several states: idle, busy, draining, or unhealthy.

Speculative decoding

An inference acceleration technique that pairs a fast draft model with a larger target model. The draft model proposes multiple tokens in parallel; the target model verifies them, reducing latency and increasing throughput compared to standard autoregressive decoding. Acceptance rate In speculative decoding, the percentage of tokens proposed by the draft model that the target model accepts. A higher acceptance rate indicates better alignment between the two models and greater inference speedup. Draft model The fast, lightweight model used in speculative decoding. It proposes multiple candidate tokens in parallel, which the target model then verifies. Speculative decoding pair A draft model and target model configured together to enable speculative decoding. Both models must be deployed together and both checkpoints can be customized. Target model The primary large model in a speculative decoding deployment. It verifies or rejects token proposals from the draft model and produces the final output.

​Hardware

​Platform and software

​Deployment and models

​Inference and performance

​Speculative decoding

Hardware

Platform and software

Deployment and models

Inference and performance

Speculative decoding