OpenAI compatible API
This document contains the SambaStudio OpenAI compatible API reference information. It describes input and output formats for the SambaStudio OpenAI compatible API, which makes it easy to try out our open source models on existing applications.
Generic API request header
Header | Type | Value |
---|---|---|
|
Array |
When dynamic batching is disabled, the batch of requests sent is processed directly, as opposed to grouping individual requests into batches. We recommend disabling dynamic batching only if you have implemented your own queuing or batching mechanisms. Otherwise, keeping dynamic batching enabled helps optimize performance by grouping smaller requests for more efficient processing. |
Create chat completions
Creates a model response for the given chat conversation.
POST https://<your-sambastudio-domain>/v1/<project-id>/<endpoint-id>/chat/completions
Request body
The chat request body formats are described below.
Reference
Parameter | Definition | Type | Values |
---|---|---|---|
|
The name of the model to query. |
String |
The expert name. |
|
A list of messages comprising the conversation so far. |
Array of objects |
Array of message objects, each containing:
|
|
The maximum number of tokens to generate. |
Integer |
The total length of input tokens and generated tokens is limited by the model’s context length. Default value is the context length of the model. |
|
Determines the degree of randomness in the response. |
Float |
The temperature value can be between |
|
The top_p (nucleus) parameter is used to dynamically adjust the number of choices for each predicted token based on the cumulative probabilities. |
Float |
The top_p value can be between |
|
The top_k parameter is used to limit the number of choices for the next predicted word or token. |
Integer |
The top k value can be between |
|
If set, partial message deltas will be sent. |
Boolean or null |
Default is false. |
|
Options for streaming response. Only set this when you set |
Object or null |
Default is null. Value can be |
repetition_penalty |
A parameter that controls how repetitive text can be. A lower value means more repetitive, while a higher value means less repetitive. |
Float or null |
Default is The repetition penalty value can be between |
Example request
Below is an example request body for a streaming response.
{
"messages": [
{"role": "system", "content": "Answer the question in a couple sentences."},
{"role": "user", "content": "Share a happy story with me"}
],
"max_tokens": 800,
"model": "Meta-Llama-3.1-8B-Instruct",
"stream": true,
"stream_options": {"include_usage": true}
}
Response
The API returns a chat completion object , or a streamed sequence of chat completion chunk objects, if the request is streamed.
Chat completion object
Represents a chat completion response returned by model, based on the provided input.
Reference
Property | Type | Description |
---|---|---|
id |
String |
A unique identifier for the chat completion. |
choices |
Array |
A list containing a single chat completion. |
created |
Integer |
The Unix timestamp (in seconds) of when the chat completion was created. Each chunk has the same timestamp. |
model |
String |
The model used to generate the completion. |
object |
String |
The object type, which is always |
usage |
Object |
An optional field present when When present, it contains a null value except for the last chunk, which contains the token usage statistics for the entire request. Values returned are:
|
Chat completion chunk object
Represents a streamed chunk of a chat completion response returned by model, based on the provided input.
Reference
Property | Type | Description |
---|---|---|
id |
String |
A unique identifier for the chat completion. |
choices |
Array |
A list containing a single chat completion. |
created |
Integer |
The Unix timestamp (in seconds) of when the chat completion was created. Each chunk has the same timestamp. |
model |
String |
The object type, which is always |
usage |
Object |
An optional field present when When present, it contains a Values returned are:
|
Batch API
You can send a batch of queries in one request using the batch API.
curl --location 'https://<your-sambastudio-domain>/v1/<project-id>/<endpoint-id>/chat/completions' \
--header 'Content-Type: application/json' \
--header 'key: API Key' \
--data '[
{
"model": "Meta-Llama-3-8B-Instruct",
"messages": [
{
"role": "system",
"content": "You are an AI assistant that helps with answering questions and providing information."
},
{
"role": "user",
"content": "What is the capital of France?"
}
],
"process_prompt": true,
"max_tokens": 50,
"stream": true
},
{
"model": "Meta-Llama-3-8B-Instruct",
"messages": [
{
"role": "system",
"content": "You are an AI assistant that helps with answering questions and providing information."
},
{
"role": "user",
"content": "What is the capital of India?"
}
],
"process_prompt": true,
"max_tokens": 50,
"stream": true
}
]'
Example requests using OpenAI client
Example requests for streaming and non-streaming are shown below.
Streaming
from openai import OpenAI
client = OpenAI(
base_url="https://<your-sambastudio-domain>/v1/<project-id>/<endpoint-id>/chat/completions",
api_key= "YOUR ENDPOINT API KEY"
)
completion = client.chat.completions.create(
model="Meta-CodeLlama-70b-Instruct",
messages = [
{"role": "system", "content": "You are intelligent"},
{"role": "user", "content": "Tell me a story in 3 lines"}
],
stream=True
)
for chunk in completion:
print(chunk.choices[0].delta)
Non-streaming
from openai import OpenAI
client = OpenAI(
base_url="https://<your-sambastudio-domain>/v1/<project-id>/<endpoint-id>/chat/completions",
api_key= "YOUR ENDPOINT API KEY"
)
response = client.chat.completions.create(
model="Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "Answer the question in a couple sentences."},
{"role": "user", "content": "Share a happy story with me"}
]
)
print(response.choices[0].message)