Monitoring with Prometheus and Grafana

SambaStack’s monitoring and observability reference architecture uses open-source components to provide metrics, logs, and dashboards for on-premises deployments. This guide describes how to deploy Prometheus, Grafana, the Prometheus Operator, and Node Exporter into a monitoring namespace using the official kube-prometheus-stack Helm chart.

Reference Architecture Note: This setup uses third-party components (Prometheus, Grafana, etc.). Versions, defaults, and command syntax may change over time. Address any issues not specific to SambaStack to the vendor or project that owns that component.

Prerequisites

Before you begin, ensure the following requirements are met:

kubectl — Configured with access to your target Kubernetes cluster
Helm (latest version) — Verify with helm version
jq — For parsing JSON output during verification
Monitoring namespace — If it does not exist, it will be created in Step 1
Storage class — A valid storage class for Prometheus persistent storage

Optional prerequisites

OpenSearch — Required only if you want Grafana to visualize logs. If deploying with OpenSearch integration, complete the OpenSearch deployment first. The opensearch-initial-admin-password secret must exist.

Deployment Order: If using the full monitoring stack, deploy in this order: OpenSearch → Fluent Bit → Prometheus/Grafana. If you only need metrics (no log visualization), you can deploy Prometheus/Grafana independently.

Resource requirements

The following are recommended minimum resources for the monitoring stack:

Component	CPU Request	Memory Request	Storage
Prometheus (per replica)	500m	2Gi	30Gi
Grafana	250m	512Mi	—
Node Exporter (per node)	100m	128Mi	—
Prometheus Operator	200m	256Mi	—

For larger deployments (100+ nodes or high-cardinality metrics), increase Prometheus memory to 4–8Gi and storage to 50–100Gi. Adjust retentionSize accordingly in the values file.

Architecture overview

Component	Purpose
Prometheus	Collects and stores time-series metrics from SambaStack services and cluster nodes
Prometheus Operator	Manages Prometheus configuration via Kubernetes CRDs (ServiceMonitor, etc.)
Node Exporter	Exposes node/rack-level host metrics for Prometheus to scrape
Grafana	Visualizes Prometheus metrics and OpenSearch logs in dashboards

Data flows

Metrics path:

SambaStack services / Node Exporter → /metrics → Prometheus → Grafana dashboards

Logs path (requires OpenSearch):

Pods / system logs → Fluent Bit → OpenSearch → Grafana (log panels)

Deployment steps

Step 1: Create the monitoring namespace

Skip this step if the namespace already exists from an OpenSearch deployment. Create the namespace configuration file at ~/.sambastack-observability/monitoring-namespace.yaml:

# monitoring-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring

Apply the configuration:

kubectl apply -f monitoring-namespace.yaml

Step 2: Add Helm repository

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Step 3: Create Grafana admin credentials secret

Create a secret with base64-encoded username and password at ~/.sambastack-observability/monitoring/grafana-initial-admin-credentials-secret.yaml:

# grafana-initial-admin-credentials-secret.yaml
apiVersion: v1
data:
  admin-user: <base64-encoded-username>
  admin-password: <base64-encoded-password>
kind: Secret
metadata:
  name: grafana-initial-admin-credentials
  namespace: monitoring
type: Opaque

To base64 encode a value: echo -n 'your-value' | base64

Apply the secret:

kubectl -n monitoring apply -f grafana-initial-admin-credentials-secret.yaml

Step 4: Create values file

Create the following file ~/.sambastack-observability/monitoring/prometheus-grafana-values.yaml. Choose the appropriate configuration based on whether you’re integrating with OpenSearch.

With OpenSearch Integration
Without OpenSearch (Metrics Only)

Use this configuration if you have deployed OpenSearch and want log visualization in Grafana.

This configuration requires the opensearch-initial-admin-password secret to exist in the monitoring namespace. See Log Storage - OpenSearch.

View complete values file (with OpenSearch)

# prometheus-grafana-values.yaml (with OpenSearch)
alertmanager:
  enabled: false
kubeStateMetrics:
  enabled: false
kubernetesServiceMonitors:
  enabled: false
defaultRules:
  create: false
nodeExporter:
  enabled: true
prometheusOperator:
  serviceMonitor:
    selfMonitor: false
  nodeSelector: {}
  tolerations: []
  tls:
    enabled: false
  admissionWebhooks:
    enabled: false
prometheus:
  enabled: true
  serviceMonitor:
    selfMonitor: false
  prometheusSpec:
    replicas: 2
    nodeSelector: {}
    tolerations: []
    serviceMonitorSelector:
      matchLabels: {}
    ruleSelector:
      matchLabels: {}
    scrapeConfigSelector:
      matchLabels: {}
    retentionSize: 25GB
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: <your-storage-class>
          resources:
            requests:
              storage: 30Gi
grafana:
  enabled: true
  defaultDashboardsEnabled: false
  serviceMonitor:
    enabled: false
  admin:
    existingSecret: grafana-initial-admin-credentials
    userKey: admin-user
    passwordKey: admin-password
  extraEnvs:
  - name: OPENSEARCH_INITIAL_ADMIN_PASSWORD
    valueFrom:
      secretKeyRef:
        name: opensearch-initial-admin-password
        key: OPENSEARCH_INITIAL_ADMIN_PASSWORD
  plugins:
  - grafana-opensearch-datasource
  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
        - name: Prometheus
          type: prometheus
          access: proxy
          url: http://kube-prometheus-stack-prometheus:9090
          isDefault: true
        - name: OpenSearch-Logs
          type: grafana-opensearch-datasource
          access: proxy
          url: https://opensearch-cluster-master.{{ .Release.Namespace }}.svc.cluster.local:9200
          withCredentials: true
          basicAuth: true
          basicAuthUser: admin
          basicAuthPassword: $__env{OPENSEARCH_INITIAL_ADMIN_PASSWORD}
          editable: true
          readOnly: false
          jsonData:
            timeField: "@timestamp"
            database: logs-7d
            tlsSkipVerify: true
            logLevelField: log_level
            logMessageField: message
            version: 2.12.0
            versionLabel: OpenSearch 2.12.0
            flavor: opensearch
            maxConcurrentShardRequests: 5
            pplEnabled: true
            serverless: false

Use this configuration if you only need Prometheus metrics without log visualization.

View complete values file (metrics only)

# prometheus-grafana-values.yaml (metrics only)
alertmanager:
  enabled: false
kubeStateMetrics:
  enabled: false
kubernetesServiceMonitors:
  enabled: false
defaultRules:
  create: false
nodeExporter:
  enabled: true
prometheusOperator:
  serviceMonitor:
    selfMonitor: false
  nodeSelector: {}
  tolerations: []
  tls:
    enabled: false
  admissionWebhooks:
    enabled: false
prometheus:
  enabled: true
  serviceMonitor:
    selfMonitor: false
  prometheusSpec:
    replicas: 2
    nodeSelector: {}
    tolerations: []
    serviceMonitorSelector:
      matchLabels: {}
    ruleSelector:
      matchLabels: {}
    scrapeConfigSelector:
      matchLabels: {}
    retentionSize: 25GB
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: <your-storage-class>
          resources:
            requests:
              storage: 30Gi
grafana:
  enabled: true
  defaultDashboardsEnabled: false
  serviceMonitor:
    enabled: false
  admin:
    existingSecret: grafana-initial-admin-credentials
    userKey: admin-user
    passwordKey: admin-password
  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
        - name: Prometheus
          type: prometheus
          access: proxy
          url: http://kube-prometheus-stack-prometheus:9090
          isDefault: true

Replace <your-storage-class> with your cluster’s storage class. To find available storage classes: kubectl get storageclass

Component configuration summary

Component	Status	Purpose
Prometheus	✓ Enabled	Metrics storage and querying
Prometheus Operator	✓ Enabled	Manages Prometheus config via CRDs
Grafana	✓ Enabled	Visualization for metrics and logs
Node Exporter	✓ Enabled	Node/rack-level host metrics
Alertmanager	✗ Disabled	Not enabled in this minimal reference

Step 5: Install kube-prometheus-stack

Run the Helm install command:

helm upgrade --install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --version 60.2.0 \
  -n monitoring \
  -f prometheus-grafana-values.yaml

This deploys Prometheus, Prometheus Operator, Node Exporter, and Grafana.

Step 6: Create ServiceMonitor for inference router

Prometheus uses a ServiceMonitor to discover and scrape metrics from the SambaStack inference router. Create the following file ~/.sambastack-observability/monitoring/inference-router-sm.yaml.

# inference-router-sm.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: inference-router-sm
  namespace: monitoring
spec:
  endpoints:
    - interval: 30s
      port: inference-router
      path: /v1/metrics
  selector:
    matchLabels:
      sambanova.ai/app: inference-router

Apply the ServiceMonitor:

kubectl apply -f inference-router-sm.yaml

Verification

Check pod status

Check Prometheus pods:

kubectl -n monitoring get pods -l app.kubernetes.io/name=prometheus

Expected output:

NAME                                    READY   STATUS    RESTARTS   AGE
prometheus-kube-prometheus-stack-prometheus-0   2/2     Running   0          5m
prometheus-kube-prometheus-stack-prometheus-1   2/2     Running   0          5m

Check Grafana pod:

kubectl -n monitoring get pods -l app.kubernetes.io/name=grafana

Expected output:

NAME                                          READY   STATUS    RESTARTS   AGE
kube-prometheus-stack-grafana-XXXXX-XXXXX     1/1     Running   0          5m

Check Node Exporter pods:

kubectl -n monitoring get pods -l app.kubernetes.io/name=prometheus-node-exporter

Access the UIs

Prometheus
Grafana

Port-forward Prometheus:

kubectl -n monitoring port-forward svc/kube-prometheus-stack-prometheus 9090:9090 &

Access at: http://localhost:9090To stop: pkill -f "port-forward.*9090"

Port-forward Grafana:

kubectl -n monitoring port-forward svc/kube-prometheus-stack-grafana 8080:80 &

Access at: http://localhost:8080To stop: pkill -f "port-forward.*8080"

Retrieve Grafana credentials

kubectl -n monitoring get secret grafana-initial-admin-credentials -o json \
  | jq -r '.data | to_entries[] | "\(.key): \(.value | @base64d)"'

Verify Prometheus targets

Access Prometheus UI at http://localhost:9090
Navigate to Status → Targets
Verify node-exporter targets show as UP
Verify inference-router-sm target shows as UP (if SambaStack is running)

Verify Grafana datasources

Log in to Grafana at http://localhost:8080
Navigate to Connections → Data sources
Verify Prometheus datasource shows “Data source is working”
If using OpenSearch integration, verify OpenSearch-Logs datasource shows “Data source is working”

Success criteria

The installation is complete when:

Prometheus pods (2 replicas) are running in the monitoring namespace
Grafana pod is running and accessible
Node Exporter pods are running on all nodes
Prometheus shows Node Exporter targets as UP
Grafana accepts the configured admin credentials
Prometheus datasource in Grafana shows “Data source is working”
(If configured) OpenSearch datasource in Grafana shows “Data source is working”

Import Node Exporter dashboard

To visualize node/rack-level metrics from Node Exporter:

Use the credentials retrieved in the verification section.

Navigate to Import

Go to Dashboards → New → Import.

Import Dashboard

Enter dashboard ID: 1860 and click Load.

Select Datasource

Select Prometheus as the datasource and click Import.

This dashboard includes: CPU usage, memory usage, disk I/O, network metrics, and node health.

Dashboard ID 1860 is the community “Node Exporter Full” dashboard. This is ideal for per-rack visibility in SambaStack deployments.

Configuration reference

Parameter	Default	Description
`prometheus.prometheusSpec.replicas`	2	Number of Prometheus replicas for HA
`prometheus.prometheusSpec.retentionSize`	25GB	Maximum storage before oldest data is deleted
`prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage`	30Gi	PVC size for Prometheus data
`grafana.admin.existingSecret`	grafana-initial-admin-credentials	Secret containing admin credentials
`nodeExporter.enabled`	true	Deploy Node Exporter DaemonSet

Troubleshooting

Prometheus pods stuck in Pending

Symptom: Prometheus pods remain in Pending status. Cause: PersistentVolumeClaim cannot be fulfilled. Solution:

# Check PVC status
kubectl -n monitoring get pvc

# Verify storage class exists
kubectl get storageclass

Grafana shows “Data source is not working” for OpenSearch

Symptom: OpenSearch datasource test fails in Grafana. Possible causes:

OpenSearch not deployed: Deploy OpenSearch first. See Log Storage - OpenSearch.

Secret missing: Verify the secret exists:

kubectl -n monitoring get secret opensearch-initial-admin-password

OpenSearch not ready: Check OpenSearch pod status:

kubectl -n monitoring get pod opensearch-cluster-master-0

ServiceMonitor not scraping targets

Symptom: Custom ServiceMonitor targets don’t appear in Prometheus. Solution: Verify the ServiceMonitor is in the monitoring namespace and labels match:

kubectl -n monitoring get servicemonitor
kubectl -n monitoring describe servicemonitor inference-router-sm

Node Exporter pods not running on all nodes

Symptom: Fewer Node Exporter pods than cluster nodes. Cause: Nodes may have taints that prevent scheduling. Solution: Add tolerations to the Node Exporter configuration or remove node taints.

Next steps

After Prometheus and Grafana are running:

Create custom dashboards — Build dashboards for SambaStack-specific metrics like inference latency, QPS, and accelerator utilization.
Add more ServiceMonitors — Create ServiceMonitors for other SambaStack components that expose Prometheus metrics.
Enable alerting — Configure Alertmanager for production monitoring. Update the values file to set alertmanager.enabled: true.
Explore logs — If OpenSearch and Fluent Bit are deployed, use Grafana’s Explore feature to query the logs-7d index.

Cleanup

To remove the monitoring stack from your cluster: Uninstall the Helm release:

helm uninstall kube-prometheus-stack -n monitoring

Delete the Grafana credentials secret:

kubectl delete secret grafana-initial-admin-credentials -n monitoring

Delete Prometheus PersistentVolumeClaims to free storage:

kubectl -n monitoring delete pvc -l app.kubernetes.io/name=prometheus

Delete the ServiceMonitor:

kubectl delete servicemonitor inference-router-sm -n monitoring

Deleting the PVCs permanently removes all stored metrics data. Ensure you have exported any important metrics before proceeding.

If you’re removing the entire monitoring stack including OpenSearch and Fluent Bit, you can delete the entire namespace: kubectl delete namespace monitoring. This removes all resources but is irreversible.

Overview

Installation

Service Administration

Hardware Administration

Reference Architectures

Resources

Prerequisites

Optional prerequisites

Resource requirements

Architecture overview

Data flows

Deployment steps

Step 1: Create the monitoring namespace

Step 2: Add Helm repository

Step 3: Create Grafana admin credentials secret

Step 4: Create values file

Component configuration summary

Step 5: Install kube-prometheus-stack

Step 6: Create ServiceMonitor for inference router

Verification

Check pod status

Access the UIs

Retrieve Grafana credentials

Verify Prometheus targets

Verify Grafana datasources

Success criteria

Import Node Exporter dashboard

Configuration reference

Troubleshooting

Prometheus pods stuck in Pending

Grafana shows “Data source is not working” for OpenSearch

ServiceMonitor not scraping targets

Node Exporter pods not running on all nodes

Next steps

Cleanup

Overview

Installation

Service Administration

Hardware Administration

Reference Architectures

Resources

​Prerequisites

​Optional prerequisites

​Resource requirements

​Architecture overview

​Data flows

​Deployment steps

​Step 1: Create the monitoring namespace

​Step 2: Add Helm repository

​Step 3: Create Grafana admin credentials secret

​Step 4: Create values file

​Component configuration summary

​Step 5: Install kube-prometheus-stack

​Step 6: Create ServiceMonitor for inference router

​Verification

​Check pod status

​Access the UIs

​Retrieve Grafana credentials

​Verify Prometheus targets

​Verify Grafana datasources

​Success criteria

​Import Node Exporter dashboard

​Configuration reference

​Troubleshooting

​Prometheus pods stuck in Pending

​Grafana shows “Data source is not working” for OpenSearch

​ServiceMonitor not scraping targets

​Node Exporter pods not running on all nodes

​Next steps

​Cleanup

Prerequisites

Optional prerequisites

Resource requirements

Architecture overview

Data flows

Deployment steps

Step 1: Create the monitoring namespace

Step 2: Add Helm repository

Step 3: Create Grafana admin credentials secret

Step 4: Create values file

Component configuration summary

Step 5: Install kube-prometheus-stack

Step 6: Create ServiceMonitor for inference router

Verification

Check pod status

Access the UIs

Retrieve Grafana credentials

Verify Prometheus targets

Verify Grafana datasources

Success criteria

Import Node Exporter dashboard

Configuration reference

Troubleshooting

Prometheus pods stuck in Pending

Grafana shows “Data source is not working” for OpenSearch

ServiceMonitor not scraping targets

Node Exporter pods not running on all nodes

Next steps

Cleanup