RDU manifest events
RDU manifest events are structured logs emitted per request by the model runtime. They contain token counts, high-level latencies, and a set of detailed timing fields. These events are typically indexed into a log index (for example, an OpenSearch index) and can be filtered by fields such as model, tenant, pod, and time range.RDU manifest fields
| Field | Category | Key | Description |
|---|---|---|---|
| Prompt tokens | Tokens | prompt_tokens_count | Number of input tokens in the prompt. |
| Completion tokens | Tokens | completion_tokens_count | Number of output tokens generated. |
| Total latency | Latency | total_latency | End-to-end time from request start to last token (includes queue + compute). |
| Time to first token (TTFT) | Latency | time_to_first_token | Time from request submission to first token. |
| Completion tokens per second | Throughput | completion_tokens_per_sec | Effective throughput over the entire completion. |
| Tokens/sec after first token | Throughput | completion_tokens_after_first_per_sec | Decode throughput after first token (steady-state). |
| Acceptance rate | Spec decoding | acceptance_rate | Acceptance rate for speculative decoding. |
| Decode queue time | Internal timing | decode_queue_time | Time spent in decode-related queues (e.g. continuous batching queues). |
| Tensor transfer time | Internal timing | tensor_transfer_time | Time spent transferring tensors between components. |
| Cache transfer time | Internal timing | cache_transfer_time | Time spent transferring cache (e.g. KV cache). |
Additional internal timing fields (~15) are available for fine-grained analysis of execution stages. Contact your SambaNova representative for details.
These fields are logging events and may be subject to schema evolution.
Example queries
Examples assume a log backend that supports a query language (e.g., OpenSearch or Loki) and timestamps on each event. p95 total latency per model (last 15 minutes)- Filter:
model:"<model_name>" AND @timestamp:[now-15m TO now] - Aggregate: percentile 95 on
total_latencygrouped bymodel.
- Group by
model, compute p50/p90 for bothtime_to_first_tokenandtotal_latency.
- Filter:
decode_queue_time > <threshold> - Group by
modelortenantto identify where queueing is highest.
- Filter:
model:"<model_name>" - Aggregate: average
acceptance_rateover time.
Related topics
- Monitoring and Observability – High-level telemetry breakdown and hierarchy.
- Metrics – Router-level metrics reference.
