SambaStack emits two primary telemetry surfaces to help you observe and operate your deployments:
- Metrics (Prometheus) – Router-level metrics such as traffic, latency, queueing, and worker state. See Metrics.
- Logs (Logging Events / Manifest Events) – Detailed per-request execution events from the model runtime and other services. See Logs.
Telemetry data generally consists of three types: Metrics (numeric time series for aggregation and alerting), Logs (discrete events for debugging and forensics), and Traces (request path records for latency analysis and root cause investigation).
Monitoring Stack
SambaStack includes the emission of these two telemetry services to enable observation and troubleshooting in all aspects of AI inference workloads running on SambaNova racks. This allows for collecting, storing, and visualizing of:
- System and application logs (control plane and data plane)
- Audit events and access traces
- Usage metrics (QPS, latency, queue time, memory utilization)
- User activity (active users, sessions)
- Health and availability signals (node status, pod status, model health)
There are many third party tools and services available to build out a monitoring and observability stack that works for your organization. The reference architecture described here is SambaNova’s suggested implementation, but is completely optional. SambaNova provides an example of a default monitoring stack based on widely used open-source tools.
Many customers already have mature monitoring solutions. The SambaStack monitoring architecture is modular, so you can:
- Adopt the full stack as provided, or
- Swap individual components with equivalents from your existing observability platform (Splunk, Datadog, Elasticsearch, New Relic, etc.)
Reference architectures use third-party products. There is no guarantee that they will be updated in sync with version or command syntax changes. Address any issues not specific to SambaStack to the respective vendor.
Components
SambaStack’s reference monitoring stack uses four primary components:
| Component | Tool | Description |
|---|
| Log Forwarder | Fluent Bit | Collects logs from Kubernetes pods, nodes, and system services. Parses, enriches, and forwards logs to OpenSearch. |
| Log Storage | OpenSearch | Stores logs and audit trails at scale. Provides search, filtering, and aggregation for log data. |
| Metrics Collection | Prometheus | Scrapes metrics from SambaStack services, Kubernetes components, and node exporters. Stores time-series data for monitoring and alerting. |
| Visualization | Grafana | Connects to Prometheus and OpenSearch. Provides dashboards for metrics and log exploration. |
Architecture
VISUALIZATION
+--------------------------------------------------------------------+
| Grafana |
| (Dashboards, Alerts, Log Exploration) |
+--------------------------------------------------------------------+
| |
v v
+-------------------------------+ +-----------------------------+
| Prometheus | | OpenSearch |
| (Metrics Storage) | | (Log Storage) |
+-------------------------------+ +-----------------------------+
^ ^
| |
+-------------------------------+ +-----------------------------+
| Node Exporter | | Fluent Bit |
| (Host Metrics) | | (Log Forwarding) |
+-------------------------------+ +-----------------------------+
^ ^
| |
+--------------------------------------------------------------------+
| SAMBASTACK |
+--------------------------------------------------------------------+
| Inference Router | Model Deployments | Kubernetes Pods/Nodes |
+--------------------------------------------------------------------+
Design principles
- Modular integration — Each component exposes well-defined interfaces (Fluent Bit outputs, Prometheus remote_write, Grafana data sources). You can replace any component with an equivalent.
- Kubernetes-native — All components run on and integrate with the Kubernetes cluster where SambaStack workloads are deployed.
- Bring-your-own stack — Integrate with your existing log platform, metrics system, or visualization layer.
- Security and compliance ready — Logging, metrics, and audit data can integrate with your existing SIEM and compliance tooling.
Component substitution
You can replace any component with an equivalent from your existing observability platform:
| Component | Replaceable with | Requirements |
|---|
| OpenSearch | Elasticsearch, Splunk, Loki, Datadog Logs | Must accept logs via HTTP, gRPC, or Kafka |
| Fluent Bit | Fluentd, Vector, Logstash, Datadog Agent | Must support Kubernetes log collection |
| Prometheus | Managed Prometheus, Datadog, New Relic | Must scrape /metrics endpoints or accept remote_write |
| Grafana | Datadog dashboards, Kibana, custom tooling | Must integrate with both metrics and log sources |
Prerequisites
Before deploying the monitoring stack, ensure you have:
| Requirement | Description |
|---|
| Kubernetes cluster | A running cluster with SambaStack installed |
| kubectl | Configured with access to your target Kubernetes cluster |
| Helm (latest version) | For deploying all Helm charts |
| jq | For parsing JSON output during verification |
| Storage class | A valid storage class for persistent volumes (OpenSearch and Prometheus) |
Directory structure
Create this directory structure before starting:
mkdir -p ~/.sambastack-observability/{opensearch,fluentbit,monitoring}
After completing all deployments, your directory should contain:
~/.sambastack-observability/
|-- monitoring-namespace.yaml
|-- opensearch/
| |-- opensearch-initial-admin-password-secret.yaml
| +-- opensearch-values.yaml
|-- fluentbit/
| |-- append_tags.lua
| |-- fluentbit-conf.conf
| +-- fluentbit-values.yml
+-- monitoring/
|-- grafana-initial-admin-credentials-secret.yaml
|-- inference-router-sm.yaml
+-- prometheus-grafana-values.yaml
Deployment order
Deploy components in this order:
| Step | Component | Guide |
|---|
| 1 | OpenSearch | Log Storage |
| 2 | Fluent Bit | Log Forwarding |
| 3 | Prometheus and Grafana | Monitoring |
Resource requirements
| Component | CPU | Memory | Storage |
|---|
| OpenSearch | 2-4 cores | 4-8 GB | 50-100 GB |
| Fluent Bit (per node) | 100m | 128Mi | — |
| Prometheus | 500m | 2Gi | 30Gi |
| Prometheus Operator | 200m | 256Mi | - |
| Grafana | 250m | 512Mi | — |
| Node Exporter (per node) | 100m | 128Mi | — |