Endpoint monitoring

SambaStudio provides a monitoring dashboard (Grafana) that displays metric information for all endpoints in a selected Tenant. This allows you to monitor the performance of SambaStudio endpoints.

Open the Grafana dashboard

An organization administrator (OrgAdmin) or tenant administrator (TenantAdmin) can access the Grafana dashboard by clicking Monitoring in the left menu.

Figure 1. Monitoring menu
All user roles can access the Grafana dashboard from the Endpoints table or Endpoint window of endpoints that they have created or have been assigned access.

Figure 2. Endpoint monitoring from Endpoint table

Figure 3. Endpoint monitoring from the Endpoint window

Grafana dashboard

The Grafana dashboard will open in a new browser window. It displays information related to the selected tenant of SambaStudio.

You can use the drop-down Project and Endpoint filter options in the top bar to display corresponding information. From the right top drop-down, you can select a time range.

Please be aware of the following when interacting with the Grafana dashboard:

The dashboard will display artifact information for the past 60 days. This includes deleted projects and endpoints.
The Time to first token (TTFT), Execution time, and Queue time metrics will only be visible in the dashboard if there is more than one request (or one batch of requests) in the selected time range. This is because the platform is designed to highlight changes in metrics over time. With only a single data point, these metrics cannot effectively be displayed.
The first three values (one for each high availability instance) of the Response time and Total requests metrics will not be visible after cluster downtime.
Some metrics, such as Queue time, Execution time and Time to first token (TTFT), are only available for the following endpoints:
- Llama 2 7B (with dynamic batching)
- Llama 2 13B (with dynamic batching)
- Llama 2 70B (with dynamic batching)
- Mistral 7B
- Samba-1 Turbo
Metrics are only available for Generic API, not NLP API language models.

The table below provides an overview of several key metrics in the Grafana dashboard.

Metric	Definition
Response time (P50, 90, 95)	The time, in seconds, spent receiving a request, generating the response, and returning a response to the user. For streaming models this is the total time until the final response is sent. For single output models, this is the time to return the response. Percentile 50, 90, and 95 values are available in the dashboard.
Queue time (P50, 90, 95)	The time, in seconds, spent in the queue before the request goes to the model. Percentile 50, 90, and 95 values are available in the dashboard.
Execution time (P50, 90, 95)	The time, in seconds, spent by the model generating the response. For streaming models, this is the total time until the final response is sent. For single output models, this is the time to return the response. Percentile 50, 90, and 95 values are available in the dashboard.
Error rate	The percentage of requests that result in an error.
Time to first token (TTFT)	The latency to the first token (including the queue time). Only available for stream API calls.
Total successful predictions (per second, per minute, per day)	The total number of predictions generated by the model.

Metric

Definition

Response time (P50, 90, 95)

The time, in seconds, spent receiving a request, generating the response, and returning a response to the user. For streaming models this is the total time until the final response is sent. For single output models, this is the time to return the response. Percentile 50, 90, and 95 values are available in the dashboard.

Queue time (P50, 90, 95)

The time, in seconds, spent in the queue before the request goes to the model. Percentile 50, 90, and 95 values are available in the dashboard.

Execution time (P50, 90, 95)

The time, in seconds, spent by the model generating the response. For streaming models, this is the total time until the final response is sent. For single output models, this is the time to return the response. Percentile 50, 90, and 95 values are available in the dashboard.

Error rate

The percentage of requests that result in an error.

Time to first token (TTFT)

The latency to the first token (including the queue time). Only available for stream API calls.

Total successful predictions (per second, per minute, per day)

The total number of predictions generated by the model.

Figure 4. Example Grafana dashboard