Reference Architecture Note: This setup uses third-party components (Prometheus, Grafana, etc.). Versions, defaults, and command syntax may change over time. Address any issues not specific to SambaStack to the vendor or project that owns that component.
Prerequisites
Before you begin, ensure the following requirements are met:- kubectl — Configured with access to your target Kubernetes cluster
- Helm (latest version) — Verify with
helm version - jq — For parsing JSON output during verification
- Monitoring namespace — If it does not exist, it will be created in Step 1
- Storage class — A valid storage class for Prometheus persistent storage
Optional prerequisites
- OpenSearch — Required only if you want Grafana to visualize logs. If deploying with OpenSearch integration, complete the OpenSearch deployment first. The
opensearch-initial-admin-passwordsecret must exist.
Deployment Order: If using the full monitoring stack, deploy in this order: OpenSearch → Fluent Bit → Prometheus/Grafana. If you only need metrics (no log visualization), you can deploy Prometheus/Grafana independently.
Resource requirements
The following are recommended minimum resources for the monitoring stack:| Component | CPU Request | Memory Request | Storage |
|---|---|---|---|
| Prometheus (per replica) | 500m | 2Gi | 30Gi |
| Grafana | 250m | 512Mi | — |
| Node Exporter (per node) | 100m | 128Mi | — |
| Prometheus Operator | 200m | 256Mi | — |
For larger deployments (100+ nodes or high-cardinality metrics), increase Prometheus memory to 4–8Gi and storage to 50–100Gi. Adjust
retentionSize accordingly in the values file.Architecture overview
| Component | Purpose |
|---|---|
| Prometheus | Collects and stores time-series metrics from SambaStack services and cluster nodes |
| Prometheus Operator | Manages Prometheus configuration via Kubernetes CRDs (ServiceMonitor, etc.) |
| Node Exporter | Exposes node/rack-level host metrics for Prometheus to scrape |
| Grafana | Visualizes Prometheus metrics and OpenSearch logs in dashboards |
Data flows
Metrics path:SambaStack services / Node Exporter → /metrics → Prometheus → Grafana dashboardsLogs path (requires OpenSearch):
Pods / system logs → Fluent Bit → OpenSearch → Grafana (log panels)
Deployment steps
Step 1: Create the monitoring namespace
Skip this step if the namespace already exists from an OpenSearch deployment. Create the namespace configuration file at~/.sambastack-observability/monitoring-namespace.yaml:
Step 2: Add Helm repository
Step 3: Create Grafana admin credentials secret
Create a secret with base64-encoded username and password at~/.sambastack-observability/monitoring/grafana-initial-admin-credentials-secret.yaml:
Step 4: Create values file
Create the following file~/.sambastack-observability/monitoring/prometheus-grafana-values.yaml. Choose the appropriate configuration based on whether you’re integrating with OpenSearch.
- With OpenSearch Integration
- Without OpenSearch (Metrics Only)
Use this configuration if you have deployed OpenSearch and want log visualization in Grafana.
View complete values file (with OpenSearch)
View complete values file (with OpenSearch)
Replace
<your-storage-class> with your cluster’s storage class. To find available storage classes: kubectl get storageclassComponent configuration summary
| Component | Status | Purpose |
|---|---|---|
| Prometheus | ✓ Enabled | Metrics storage and querying |
| Prometheus Operator | ✓ Enabled | Manages Prometheus config via CRDs |
| Grafana | ✓ Enabled | Visualization for metrics and logs |
| Node Exporter | ✓ Enabled | Node/rack-level host metrics |
| Alertmanager | ✗ Disabled | Not enabled in this minimal reference |
Step 5: Install kube-prometheus-stack
Run the Helm install command:Step 6: Create ServiceMonitor for inference router
Prometheus uses a ServiceMonitor to discover and scrape metrics from the SambaStack inference router. Create the following file~/.sambastack-observability/monitoring/inference-router-sm.yaml.
Verification
Check pod status
Check Prometheus pods:Access the UIs
- Prometheus
- Grafana
Retrieve Grafana credentials
Verify Prometheus targets
- Access Prometheus UI at http://localhost:9090
- Navigate to Status → Targets
- Verify
node-exportertargets show as UP - Verify
inference-router-smtarget shows as UP (if SambaStack is running)
Verify Grafana datasources
- Log in to Grafana at http://localhost:8080
- Navigate to Connections → Data sources
- Verify Prometheus datasource shows “Data source is working”
- If using OpenSearch integration, verify OpenSearch-Logs datasource shows “Data source is working”
Success criteria
The installation is complete when:- Prometheus pods (2 replicas) are running in the
monitoringnamespace - Grafana pod is running and accessible
- Node Exporter pods are running on all nodes
- Prometheus shows Node Exporter targets as UP
- Grafana accepts the configured admin credentials
- Prometheus datasource in Grafana shows “Data source is working”
- (If configured) OpenSearch datasource in Grafana shows “Data source is working”
Import Node Exporter dashboard
To visualize node/rack-level metrics from Node Exporter:
This dashboard includes: CPU usage, memory usage, disk I/O, network metrics, and node health.
Dashboard ID 1860 is the community “Node Exporter Full” dashboard. This is ideal for per-rack visibility in SambaStack deployments.
Configuration reference
| Parameter | Default | Description |
|---|---|---|
prometheus.prometheusSpec.replicas | 2 | Number of Prometheus replicas for HA |
prometheus.prometheusSpec.retentionSize | 25GB | Maximum storage before oldest data is deleted |
prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage | 30Gi | PVC size for Prometheus data |
grafana.admin.existingSecret | grafana-initial-admin-credentials | Secret containing admin credentials |
nodeExporter.enabled | true | Deploy Node Exporter DaemonSet |
Troubleshooting
Prometheus pods stuck in Pending
Symptom: Prometheus pods remain inPending status.
Cause: PersistentVolumeClaim cannot be fulfilled.
Solution:
Grafana shows “Data source is not working” for OpenSearch
Symptom: OpenSearch datasource test fails in Grafana. Possible causes:- OpenSearch not deployed: Deploy OpenSearch first. See Log Storage - OpenSearch.
-
Secret missing: Verify the secret exists:
-
OpenSearch not ready: Check OpenSearch pod status:
ServiceMonitor not scraping targets
Symptom: Custom ServiceMonitor targets don’t appear in Prometheus. Solution: Verify the ServiceMonitor is in themonitoring namespace and labels match:
Node Exporter pods not running on all nodes
Symptom: Fewer Node Exporter pods than cluster nodes. Cause: Nodes may have taints that prevent scheduling. Solution: Add tolerations to the Node Exporter configuration or remove node taints.Next steps
After Prometheus and Grafana are running:- Create custom dashboards — Build dashboards for SambaStack-specific metrics like inference latency, QPS, and accelerator utilization.
- Add more ServiceMonitors — Create ServiceMonitors for other SambaStack components that expose Prometheus metrics.
-
Enable alerting — Configure Alertmanager for production monitoring. Update the values file to set
alertmanager.enabled: true. -
Explore logs — If OpenSearch and Fluent Bit are deployed, use Grafana’s Explore feature to query the
logs-7dindex.
Cleanup
To remove the monitoring stack from your cluster: Uninstall the Helm release:If you’re removing the entire monitoring stack including OpenSearch and Fluent Bit, you can delete the entire namespace:
kubectl delete namespace monitoring. This removes all resources but is irreversible.