API V2

This document contains SambaStudio API version 2 (V2) reference information. It describes input and output formats for the new predict V2 API.

Generic API language models

This section describes how to create a model response for a given chat conversation, or a text prompt, using the Generic V2 API.

API Type HTTP Method Endpoint

API Type	HTTP Method	Endpoint
Predict	`POST`	`https://<your-sambastudio-domain>/api/v2/predict/generic/<project-id>/<endpoint-id>`
Stream	`POST`	`https://<your-sambastudio-domain>/api/v2/predict/generic/stream/<project-id>/<endpoint-id>`

Predict

POST

https://<your-sambastudio-domain>/api/v2/predict/generic/<project-id>/<endpoint-id>

Stream

POST

https://<your-sambastudio-domain>/api/v2/predict/generic/stream/<project-id>/<endpoint-id>

Generic API request header

Header Type Value

Header	Type	Value
`ss-sn-options`	Array	`disable-dynamic-batch`: Use this header option to bypass dynamic batching, ensuring that the batch you provide is sent directly to the model for execution without any splitting or merging of requests. When dynamic batching is disabled, the batch of requests sent is processed directly, as opposed to grouping individual requests into batches. We recommend disabling dynamic batching only if you have implemented your own queuing or batching mechanisms. Otherwise, keeping dynamic batching enabled helps optimize performance by grouping smaller requests for more efficient processing.

ss-sn-options

Array

disable-dynamic-batch: Use this header option to bypass dynamic batching, ensuring that the batch you provide is sent directly to the model for execution without any splitting or merging of requests.

When dynamic batching is disabled, the batch of requests sent is processed directly, as opposed to grouping individual requests into batches. We recommend disabling dynamic batching only if you have implemented your own queuing or batching mechanisms. Otherwise, keeping dynamic batching enabled helps optimize performance by grouping smaller requests for more efficient processing.

Generic API request body

Attributes Type Description

Attributes	Type	Description
items	Array	`items`: Contains a list of items that needs to be passed for prediction. `id`: Each item contains unique id. If ids are not passed, the index of the prompt in this array will be assumed as the id. `id` should not contain the `#` character in it. `value`: The actual prompt needed for prediction.
params Available params are dependent on the model.	JSON object	Allows setting the tuning parameters to be used, specified as key value pairs. select_expert Used only for CoE models. `type`: This is always `str`. `value`: The expert model to select for the request. max_tokens_to_generate `type`: This is always `int`. `value`: The number of tokens to generate. The total length of tokens (prompt + tokens to generate) must be under 2048. do_sample `type`: This is always `bool`. `value`: Whether or not to use sampling; use greedy decoding otherwise. Sampling means randomly picking the next word according to its conditional probability distribution. And hence, language generation using sampling does not remain deterministic anymore. If you need to have deterministic results, set this as `false`, as the model is less likely to generate unexpected or unusual words. Setting it to `true` allows the model a better chance of generating a high quality response, even with inherent deficiencies. The default value is `false`. repetition_penalty `type`: This is always `float`. `value`: The repetition penalty value can be between `1.0` to `10.0`. A parameter that controls how repetitive text can be. A lower value means more repetitive, while a higher value means less repetitive. The default value is `1.0`, which means no penalty. temperature `type`: This is always `float`. `value`: The temperature value can be between `0` and `1`. Higher values makes the predictions more random and creative, while lower values (closer to 0) makes the model more deterministic. The default value is set as `1.0`. top_k `type`: This is always `float`. `value`: The top k value can be between `1` to `100`. Top k means allowing the model to choose randomly among the top k tokens by their respective probabilities. For example, choosing the top 3 tokens means setting the top k parameter to `3`. Changing the top k parameter sets the size of the shortlist the model samples from as it outputs each token. Setting top k to `1` gives us greedy decoding. The default is set as `50`. top_logprobs `type`: This is always `int`. `value`: The value can be between `0` and `20`. Shows the top N tokens by its probability to be generated. This indicates how likely was a token to be generated next. This helps debug a given generation and see alternative options to the generated token. The default is set as `0`. top_p `type`: This is always `float`. `value`: The value can be between `0` and `1`. It controls diversity via nucleus sampling; controls the randomness and originality of the model. The top p parameter specifies a sampling threshold during inference time. The top p shortlists the top tokens whose sum of likelihoods does not exceed a certain value. If set to less than `1`, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. The default is set as `1.0`.

items

Array

items: Contains a list of items that needs to be passed for prediction.
- id: Each item contains unique id.
  - If ids are not passed, the index of the prompt in this array will be assumed as the id.
  - id should not contain the # character in it.
- value: The actual prompt needed for prediction.

params

Available params are dependent on the model.

JSON object

Allows setting the tuning parameters to be used, specified as key value pairs.

select_expert: Used only for CoE models.
type: This is always str.
value: The expert model to select for the request.
max_tokens_to_generate: type: This is always int.
value: The number of tokens to generate. The total length of tokens (prompt + tokens to generate) must be under 2048.
do_sample: type: This is always bool.
value: Whether or not to use sampling; use greedy decoding otherwise. Sampling means randomly picking the next word according to its conditional probability distribution. And hence, language generation using sampling does not remain deterministic anymore. If you need to have deterministic results, set this as false, as the model is less likely to generate unexpected or unusual words. Setting it to true allows the model a better chance of generating a high quality response, even with inherent deficiencies. The default value is false.
repetition_penalty: type: This is always float.
value: The repetition penalty value can be between 1.0 to 10.0. A parameter that controls how repetitive text can be. A lower value means more repetitive, while a higher value means less repetitive. The default value is 1.0, which means no penalty.
temperature: type: This is always float.
value: The temperature value can be between 0 and 1. Higher values makes the predictions more random and creative, while lower values (closer to 0) makes the model more deterministic. The default value is set as 1.0.
top_k: type: This is always float.
value: The top k value can be between 1 to 100. Top k means allowing the model to choose randomly among the top k tokens by their respective probabilities. For example, choosing the top 3 tokens means setting the top k parameter to 3. Changing the top k parameter sets the size of the shortlist the model samples from as it outputs each token. Setting top k to 1 gives us greedy decoding. The default is set as 50.
top_logprobs: type: This is always int.
value: The value can be between 0 and 20. Shows the top N tokens by its probability to be generated. This indicates how likely was a token to be generated next. This helps debug a given generation and see alternative options to the generated token. The default is set as 0.
top_p: type: This is always float.
value: The value can be between 0 and 1. It controls diversity via nucleus sampling; controls the randomness and originality of the model. The top p parameter specifies a sampling threshold during inference time. The top p shortlists the top tokens whose sum of likelihoods does not exceed a certain value. If set to less than 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. The default is set as 1.0.

Example request

{
  "items": [
    {
      "id": "item1",
      "value": "{\"messages\":[{\"message_id\":0,\"role\":\"user\",\"content\":\"Hi\"},{\"message_id\":1,\"role\":\"assistant\",\"content\":\"Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?\"},{\"message_id\":2,\"role\":\"user\",\"content\":\"\"}]}"
    },
    {
      "id": "item2",
      "value": "{\"messages\":[{\"message_id\":0,\"role\":\"user\",\"content\":\"Hi\"},{\"message_id\":1,\"role\":\"assistant\",\"content\":\"Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?\"},{\"message_id\":2,\"role\":\"user\",\"content\":\"\"}]}"
    }
  ],
  "params": {
    "do_sample": false,
    "select_expert": "Meta-Llama-3-70B-Instruct",
    "process_prompt": true,
    "top_k": 50
  }
}

Stream response format

The stream response format for the Generic V2 API is described below.

.result.items: Contains responses for the prompts passed into the input request.
Each object in this array will have following format.
- id: Indicates which item this answer will correspond to.
- value: Contains the response details.
  - is_last_response: This will be false for all the responses for a batch item, except the last stream response.
  - stream_token: The actual response token for the prompt.

Example stream response

{
    "result": {
        "items": [
            {
                "id": "item1",
                "value": {
                    "is_last_response": false,
                    "stream_token": "severe cold symptoms, seek medical "
                }
            }
        ]
    }
}

Last stream response

The format for the last stream response is described below.

.result.items: Contains responses for the prompts passed into the input request.
If an object is the last response for that particular batch item, it will have value.is_last_response set to true.
- id: Indicates which item this answer will correspond to.
- completion: Will contain the full completion generated for the prompt.
- stream_token: Will be empty in the last response.
- start_time: The start time of the request on the server.
- end_time: The end time of the request on the server.
- prompt_tokens_count: The total tokens count in the user prompt.
- completion_tokens_count: The total tokens count generated by the model.
- batch_size_used: The batch size that was used for prediction on the server.
- stop_reason: The reason why the output was terminated.
  - Examples: max_len_reached, end_of_text, or stop_sequence_hit.
- time_to_first_token: The time to get the first token on server side.
- throughput_after_first_token: The number of tokens generated per second after the first token.
- total_tokens_count: Total tokens count including the input and output.
- model_execution_time: The total time needed by the model to execute the request.

Example last stream response

{
    "result": {
        "items": [
            {
                "id": "item1",
                "value": {
                    "batch_size_used": 1,
                    "completion": "I was created by a group of researchers at Meta AI. I'm a deep neural network, specifically a transformer, that's been trained on a large corpus of text data. My training data includes a massive amount of text from various sources, including books, articles, and websites, which allows me to understand and generate human-like language.\n\nMy development is the result of a collaboration between several researcher and engineers at Meta AI, who have expertise in natural language processing, machine learning, and software development. They've worked together to design and train me, as well as to develop the infrastructure that allows me to interact with users like you.\n\nI'm constantly learning and improving, so I can become more accurate and informative in my responses. This is possible thanks to the efforts of my developers, who continue to refine my training data and algorithms to make me a better conversational AI.",
                    "completion_tokens_count": 174,
                    "end_time": 1719850149.1811948,
                    "is_last_response": true,
                    "model_execution_time": 2.647777557373047,
                    "prompt": "\u003c|start_header_id|\u003euser\u003c|end_header_id|\u003e\n\nwho created you?\u003c|eot_id|\u003e\u003c|start_header_id|\u003eassistant\u003c|end_header_id|\u003e\n\n",
                    "prompt_tokens_count": 14,
                    "start_time": 1719850146.5334172,
                    "stop_reason": "end_of_text",
                    "stream_token": "",
                    "throughput_after_first_token": 70.48973284663778,
                    "time_to_first_token": 0.2644517421722412,
                    "total_tokens_count": 188
                }
            }
        ]
    }
}

Non-stream response

The non-stream response format for the Generic V2 API is described below.

items: Will contain responses to all the items that were sent with the input request.
Each item response will contain following fields.
- id: Indicates which input request item this output will correspond to.
- value: Actual prediction generated by the model for the given prompt.

Example non-streaming API response

{
    "items": [
        {
            "id": "item1",
            "value": "I'm just a language model, I don't have feelings or emotions like humans do, so I don't have good or bad days. I'm always \"on\" and ready to assist you with any questions or tasks you may have!\n\nThat being said, I'm functioning properly and ready to help you with anything you need. How can I assist you today?"
        },
        {
            "id": "item2",
            "value": "I'm an AI, and I can do a lot of things. Here are some examples:\n\n**Converse**: I can have a conversation with you, answering your questions, providing information, and even engaging in small talk.\n\n**Provide information**: I have been trained on a massive dataset of text and can provide information on a wide range of topics, including but not limited to:\n\n* History\n* Science\n* Technology\n* Health\n* Entertainment\n* Culture\n* Education\n* And many more!\n\n**Generate text**: I can generate text based on a prompt or topic. This can be useful for writing articles, creating content, or even composing emails.\n\n**Translate text**: I can translate text from one language to another. I currently support translations in dozens of languages.\n\n**Summarize content**: I can summarize long pieces of text, such as articles or documents, into shorter, more digestible versions.\n\n**Offer suggestions**: I can offer suggestions for things like gift ideas, travel destinations, books to read, and more.\n\n**Play games**: I can play simple text-based games with you, such as Hangman, 20 Questions, and Word Jumble.\n\n**Generate creative content**: I can generate creative content, such as poetry or short stories.\n\n**Assist with language learning**: I can help with language learning by providing grammar explanations, vocabulary practice, and conversation practice.\n\n**Provide definitions**: I can define words and phrases, explaining their meanings and usage.\n\n**And more!**: I'm constantly learning and improving, so there may be other things I can do that aren't listed here.\n\nWhat would you like to do or talk about?"
        },
        {
            "id": "item3",
            "value": "I was created by a group of researcher at Meta AI. I'm a deep neural network, specifically a transformer, that's been trained on a large corpus of text data. My training data includes a massive amount of text from various sources, including books, articles, and websites, which allows me to understand and generate human-like language.\n\nMy development is the result of a collaboration between several researcher and engineers at Meta AI, who have expertise in natural language processing, machine learning, and software development. They've worked together to design and train me, as well as to develop the infrastructure that allows me to interact with users like you.\n\nI'm constantly learning and improving, so I can become more accurate and informative in my responses. This is possible thanks to the efforts of my developers, who continue to refine my training data and algorithms to make me a better conversational AI."
        },
        {
            "id": "item4",
            "value": "Here are 10 things not to do when you're feeling cold:\n\n1. **Don't stay in wet clothes**: Wet clothes can make you lose heat quickly, making you feel even colder. Change into dry, warm clothes as soon as possible.\n2. **Avoid drinking cold beverages**: Drinking cold drinks can lower your body temperature further, making you feel colder. Opt for warm beverages like tea, coffee, or hot chocolate instead.\n3. **Don't take a cold shower**: Taking a cold shower can cause your body to lose heat rapidly, making you feel colder. Take a warm or hot shower instead to help raise your body temperature.\n4. **Don't go outside without dressing warmly**: If you need to go outside, make sure to dress warmly in layers, including a hat, gloves, and scarf. This will help prevent heat loss.\n5. **Don't ignore hypothermia symptoms**: If you're experiencing symptoms like shivering, confusion, or dizziness, seek medical attention immediately. Hypothermia can be life-threatening if left untreated.\n6. **Don't rely on caffeine or nicotine**: While caffeine and nicotine may provide a temporary energy boost, they can also cause blood vessels to constrict, making you feel colder.\n7. **Don't skip meals**: Eating regular meals can help keep your body warm by providing energy. Opt for warm, hearty meals like soup or stew.\n8. **Don't use electric blankets or heating pads on high**: While electric blankets and heating pads can provide warmth, using them on high settings can cause burns or fires. Use them on low settings and follow the manufacturer's instructions.\n9. **Don't neglect to maintain your heating system**: Make sure your heating system is in good working condition to prevent breakdowns and keep your home warm.\n10. **Don't underestimate the power of physical activity**: Engage in light physical activity like stretching, jumping jacks, or yoga to get your blood flowing and warm yourself up. However, avoid strenuous exercise that can cause you to sweat and lose heat.\n\nRemember to always prioritize your safety and health when feeling cold. If you're experiencing persistent or severe cold symptoms, seek medical attention if necessary."
        }
    ]
}

Embeddings API

POST https://<your-sambastudio-domain>/api/v2/predict/generic/<project-id>/<endpoint-id>

Attributes Type Description

Attributes	Type	Description
items	Array	`items`: Contains a list of items that needs to be passed for prediction. `id`: Each item contains unique id. If ids are not passed, the index of the prompt in this array will be assumed as the id. `id` should not contain the `#` character in it. `value`: The text to embed as a string. To embed multiple inputs in a single request, pass an array of strings. The input must not exceed the model’s maximum token limit and cannot be empty.
params	JSON object	select_expert Used only for CoE models. `type`: This is always `str`. `value`: The expert model to select for the request.

items

Array

items: Contains a list of items that needs to be passed for prediction.
- id: Each item contains unique id.
  - If ids are not passed, the index of the prompt in this array will be assumed as the id.
  - id should not contain the # character in it.
- value: The text to embed as a string. To embed multiple inputs in a single request, pass an array of strings. The input must not exceed the model’s maximum token limit and cannot be empty.

params

JSON object

select_expert: Used only for CoE models.
type: This is always str.
value: The expert model to select for the request.

Example request

curl --location 'https://<your-sambastudio-domain>/api/v2/predict/generic/<project-id>/<endpoint-id>' \
--header 'Content-Type: application/json' \
--header 'key: API KEY' \
--data '{
    "items": [
        {
            "id": "item0",
            "value": "Hello"
        },
        {
            "id": "item1",
            "value": "how are you?"
        }
    ],
    "params":
    {
        "select_expert": "e5-mistral-7b-instruct-8192"
    }
}'

Example response

{
    "items": [
        {
            "id": "item0",
            "partial": false,
            "value": [
                0.016115479171276093,
                -0.0016850305255502462,
                -0.0032306977082043886,
                -0.00342073873616755,
                0.010692975483834743,
                ......
                0.0045765722170472145,
                -0.009879584424197674,
                0.014238224364817142
            ],
            "params": {},
            "status": null
        }
    ]
}