- Metrics (Prometheus) – Router-level metrics such as traffic, latency, queueing, and worker state. See Metrics.
- Logs (Logging Events / Manifest Events) – Detailed per-request execution events from the model runtime and other services. See Logs.
Telemetry data generally consists of three types: Metrics (numeric time series for aggregation and alerting), Logs (discrete events for debugging and forensics), and Traces (request path records for latency analysis and root cause investigation).
Monitoring stack overview
SambaStack includes data and metadata designed to help operators observe and troubleshoot all aspects of their AI inference workloads running on SambaNova racks. The monitoring stack is responsible for collecting, storing, and visualizing:- System and application logs (control plane and data plane)
- Audit events and access traces
- Usage metrics (QPS, latency, queue time, memory utilization)
- User activity (active users, sessions)
- Health and availability signals (node status, pod status, model health)
Reference architecture
The reference architecture described here is SambaNova’s suggested implementation, but is completely optional. SambaNova provides an example of a default monitoring stack based on widely used open-source tools. Many customers already have mature monitoring solutions. The SambaStack monitoring architecture is modular, so you can:- Adopt the full stack as provided, or
- Swap individual components with equivalents from your existing observability platform (Splunk, Datadog, Elasticsearch, New Relic, etc.).
Reference architectures are constructed using numerous third-party products. There is no guarantee that they will be updated in lock step with version or command syntax changes of those third-party products. Any errors not directly applicable to SambaStack should be addressed to the vendor of the component having the issue.
Components
SambaStack’s reference monitoring stack uses four primary components:| Component | Tool | Description |
|---|---|---|
| Log Forwarder and Processor | Fluent Bit | Collects logs from Kubernetes (pods, nodes, system services). Parses, enriches, and forwards logs to a log backend (e.g., OpenSearch or your existing log platform). |
| Log Storage and Search | OpenSearch | Stores logs, audit trails, and structured events at scale. Provides search, filtering, aggregation, and dashboards for log data. Acts as the canonical source of truth for log and audit history in the reference architecture. |
| Metrics Collection and Alerting | Prometheus | Scrapes metrics from SambaStack services, Kubernetes components, and node exporters. Stores time-series metrics for performance, capacity, and health monitoring. Serves as the primary source for alerting rules (through Prometheus or Alertmanager). |
| Dashboards and Visualization | Grafana | Connects to Prometheus and OpenSearch (or your equivalents). Provides pre-built dashboards and can integrate with your SSO/IdP for role-based access to monitoring views. |
Design principles
The reference architecture is designed around a few core principles:- Modular integration – Each component exposes well-defined interfaces (e.g., Fluent Bit outputs, Prometheus remote_write, Grafana data sources). You can replace any component with an equivalent that provides the same interface.
- Kubernetes-native – All components are designed to run on, integrate with, or observe your Kubernetes cluster(s) where SambaStack workloads are deployed.
- Bring-your-own stack friendly – If you already have:
- A centralized log platform → integrate Fluent Bit outputs with it.
- A metrics/TSDB system → use Prometheus as a scrape endpoint or replace it with your own collector.
- An existing visualization layer → connect it directly to OpenSearch/Prometheus or replace Grafana entirely.
- Security and compliance ready – Logging, metrics, and audit data can be integrated with your existing SIEM and compliance tooling.
Component substitution
You can replace any reference architecture component with an equivalent from your existing observability platform.| Component | Replaceable with | Requirements |
|---|---|---|
| Log Storage and Search (OpenSearch) | Elasticsearch, Splunk, Loki, Datadog Logs, or your SIEM | Must accept logs over a protocol supported by Fluent Bit (HTTP, gRPC, Kafka, etc.) and support your retention and compliance needs. |
| Log Forwarder and Processor (Fluent Bit) | Fluentd, Vector, Logstash, Datadog Agent | Must support Kubernetes log collection and be able to send logs to your selected log storage platform. |
| Metrics Collection (Prometheus) | Managed Prometheus services, Datadog/New Relic agents, internal TSDBs | Must be able to scrape or receive /metrics endpoints or accept Prometheus remote_write. |
| Dashboards and Visualization (Grafana) | Datadog dashboards, Kibana, custom internal tooling | Must integrate with both metrics and log sources to offer equivalent visibility. |
