API reference and Swagger

This document contains API reference information and describes how to access and interact with the SambaStudio Swagger framework.

Online generative inference

Once you have deployed an endpoint for a generative model, you can run online inference against it to get completions for prompts.

Text models

Creates a model response for the given chat conversation or a text prompt.

API Type HTTP Method Endpoint

Predict

POST

The URL of the endpoint displayed in the Endpoint window.

Example: https://<your-sambastudio-domain>/api/predict/generic/<project-id>/<endpoint-id>

Stream

POST

The Stream URL of the endpoint displayed in the Endpoint window.

Example: https://<your-sambastudio-domain>/api/predict/generic/stream/<project-id>/<endpoint-id>

Request body

Attributes Type Description

inputs

Array (strings)

A list of prompts to provide to the model.

param

JSON object

Allows setting the tuning parameters to be used, specified as key value pairs.

max_tokens_to_generate

type: This is always int.
value: The number of tokens to generate. The total length of tokens (prompt + tokens to generate) must be under 2048.

do_sample

type: This is always bool.
value: Whether or not to use sampling; use greedy decoding otherwise. Sampling means randomly picking the next word according to its conditional probability distribution. And hence, language generation using sampling does not remain deterministic anymore. If you need to have deterministic results, set this as false, as the model is less likely to generate unexpected or unusual words. Setting it to true allows the model a better chance of generating a high quality response, even with inherent deficiencies. The default value is false.

repetition_penalty

type: This is always float.
value: The repetition penalty value can be between 1.0 to 10.0. A parameter that controls how repetitive text can be. A lower value means more repetitive, while a higher value means less repetitive. The default value is 1.0, which means no penalty.

temperature

type: This is always float.
value: The temperature value can be between 0 and 1. Higher values makes the predictions more random and creative, while lower values (closer to 0) makes the model more deterministic. The default value is set as 1.0.

top_k

type: This is always float.
value: The top k value can be between 1 to 100. Top k means allowing the model to choose randomly among the top k tokens by their respective probabilities. For example, choosing the top 3 tokens means setting the top k parameter to 3. Changing the top k parameter sets the size of the shortlist the model samples from as it outputs each token. Setting top k to 1 gives us greedy decoding. The default is set as 50.

top_logprobs

type: This is always int.
value: The value can be between 0 and 20. Shows the top N tokens by its probability to be generated. This indicates how likely was a token to be generated next. This helps debug a given generation and see alternative options to the generated token. The default is set as 0.

top_p

type: This is always float.
value: The value can be between 0 and 1. It controls diversity via nucleus sampling; controls the randomness and originality of the model. The top p parameter specifies a sampling threshold during inference time. The top p shortlists the top tokens whose sum of likelihoods does not exceed a certain value. If set to less than 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. The default is set as 1.0.

Request template
curl --location '<your-endpoint-url>' \
--header 'Content-Type: application/json' \
--header 'key: <your-endpoint-key>' \
--data '{
    "inputs": [
        "Whats the capital of Austria?"
    ],
    "params": {
        "do_sample": {
            "type": "bool",
            "value": "false"
        },
        "max_tokens_to_generate": {
            "type": "int",
            "value": "100"
        },
        "repetition_penalty": {
            "type": "float",
            "value": "1"
        },
        "temperature": {
            "type": "float",
            "value": "1"
        },
        "top_k": {
            "type": "int",
            "value": "50"
        },
        "top_logprobs": {
            "type": "int",
            "value": "0"
        },
        "top_p": {
            "type": "float",
            "value": "1"
        }
    }
}'

Predict response

Attributes Type Description

data

Array

Array of response for each prompt in the input array.

stop_reason

String

Indicates when the model stopped generating subsequent tokens. Possible values can be max_len_reached, end_of_text, or stop_sequence_hit.

completion

String

The model’s prediction for the input prompt.

total_tokens_count

Float

Count of the total tokens generated by the model.

tokens

Array

Array of the tokens generated for the given prompt.

logprobs

JSON

The top N tokens by its probability to be generated. This indicates how likely was a token to be generated next. The value will be null if top_logprobs is present in the input request.

prompt

String

The prompt provided in the input.

Example predict response
{
    "data": [
        {
            "prompt": "Whats the capital of Austria?",
            "tokens": [
                "\n",
                "\n",
                "Answer",
                ":",
                "The",
                "capital",
                "of",
                "Austria",
                "is",
                "Vienna",
                "(",
                "G",
                "erman",
                ":",
                "Wien",
                ")."
            ],
            "total_tokens_count": 24.0,
            "completion": "\n\nAnswer: The capital of Austria is Vienna (German: Wien).",
            "logprobs": {
                "top_logprobs": null,
                "text_offset": null
            },
            "stop_reason": "end_of_text"
        }
    ]
}

Stream response

If the request is streamed, the response will be a sequence of chat completion objects with a final response indicating its completion.

Attributes Type Description

stop_reason

String

Indicates when the model stopped generating subsequent tokens. Possible values can be max_len_reached, end_of_text, or stop_sequence_hit.

completion

String

The model’s prediction for the input prompt.

total_tokens_count

Float

Count of the total tokens generated by the model.

tokens

Array

Array of the tokens generated for the given prompt.

logprobs

JSON

The top N tokens by its probability to be generated. This indicates how likely was a token to be generated next. The value will be null if top_logprobs is present in the input request.

prompt

String

The prompt provided in the input.

is_last_response

Boolean

To determine if it is the last response from the model.

stream_token

String

The stream token for the response.

Example chat completion stream object
{
    "prompt": "",
    "tokens": null,
    "stop_reason": "",
    "logprobs": {
        "top_logprobs": null,
        "text_offset": null
    },
    "is_last_response": false,
    "completion": "",
    "total_tokens_count": 0.0,
    "stream_token": "?\n\nAnswer: "
}
Example end event response
{
    "prompt": "Whats the capital of Austria",
    "tokens": [
        "?",
        "\n",
        "\n",
        "Answer",
        ":",
        "The",
        "capital",
        "of",
        "Austria",
        "is",
        "Vienna",
        "(",
        "G",
        "erman",
        ":",
        "Wien",
        ")."
    ],
    "stop_reason": "end_of_text",
    "logprobs": {
        "top_logprobs": null,
        "text_offset": null
    },
    "is_last_response": true,
    "completion": "?\n\nAnswer: The capital of Austria is Vienna (German: Wien).",
    "total_tokens_count": 24.0,
    "stream_token": ""
}
Example complete stream response
{"prompt": "", "tokens": null, "stop_reason": "", "logprobs": {"top_logprobs": null, "text_offset": null}, "is_last_response": false, "completion": "", "total_tokens_count": 0.0, "stream_token": "?\n\nAnswer: "}
{"prompt": "", "tokens": null, "stop_reason": "", "logprobs": {"top_logprobs": null, "text_offset": null}, "is_last_response": false, "completion": "", "total_tokens_count": 0.0, "stream_token": "The "}
....
....
...
{"stop_reason": "", "tokens": null, "prompt": "", "logprobs": {"top_logprobs": null, "text_offset": null}, "is_last_response": false, "completion": "", "total_tokens_count": 0.0, "stream_token": "Wien)."}
{"prompt": "Whats the capital of Austria", "tokens": ["?", "\n", "\n", "Answer", ":", "The", "capital", "of", "Austria", "is", "Vienna", "(", "G", "erman", ":", "Wien", ")."], "stop_reason": "end_of_text", "logprobs": {"top_logprobs": null, "text_offset": null}, "is_last_response": true, "completion": "?\n\nAnswer: The capital of Austria is Vienna (German: Wien).", "total_tokens_count": 24.0, "stream_token": ""}

Multimodal model

You can use the LLaVA multimodal API to generate both text and image inference.

HTTP Method Endpoint

POST

The URL of the endpoint displayed in the Endpoint window.

Request body

Attributes Type Description

instances

Array (JSON)

An array of prompts along with an image to provide to the model (currently only one prompt and image is supported).

prompt : Prompt for the model in string
image_content : <base64-encoded-string-of-the-image>

param

JSON object

Allows setting the tuning parameters to be used, specified as key value pairs.

max_tokens_to_generate

type: This is always int.
value: The number of tokens to generate. The total length of tokens (prompt + tokens to generate) must be under 2048.

do_sample

type: This is always bool.
value: Whether or not to use sampling; use greedy decoding otherwise. Sampling means randomly picking the next word according to its conditional probability distribution. And hence, language generation using sampling does not remain deterministic anymore. If you need to have deterministic results, set this as false, as the model is less likely to generate unexpected or unusual words. Setting it to true allows the model a better chance of generating a high quality response, even with inherent deficiencies. The default value is false.

repetition_penalty

type: This is always float.
value: The repetition penalty value can be between 1.0 to 10.0. A parameter that controls how repetitive text can be. A lower value means more repetitive, while a higher value means less repetitive. The default value is 1.0, which means no penalty.

temperature

type: This is always float.
value: The temperature value can be between 0 and 1. Higher values makes the predictions more random and creative, while lower values (closer to 0) makes the model more deterministic. The default value is set as 1.0.

top_k

type: This is always float.
value: The top k value can be between 1 to 100. Top k means allowing the model to choose randomly among the top k tokens by their respective probabilities. For example, choosing the top 3 tokens means setting the top k parameter to 3. Changing the top k parameter sets the size of the shortlist the model samples from as it outputs each token. Setting top k to 1 gives us greedy decoding. The default is set as 50.

top_logprobs

type: This is always int.
value: The value can be between 0 and 20. Shows the top N tokens by its probability to be generated. This indicates how likely was a token to be generated next. This helps debug a given generation and see alternative options to the generated token. The default is set as 0.

top_p

type: This is always float.
value: The value can be between 0 and 1. It controls diversity via nucleus sampling; controls the randomness and originality of the model. The top p parameter specifies a sampling threshold during inference time. The top p shortlists the top tokens whose sum of likelihoods does not exceed a certain value. If set to less than 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. The default is set as 1.0.

Request template
curl --location 'https://<host>/api/predict/generic/<project-id>/<endpoint-id>' \
--header 'Content-Type: application/json' \
--header 'key: <your-endpoint-key>' \
--data '{
    "instances": [
        {
            "prompt": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the humans questions. USER: What are the things I should be cautious about when I visit here? ASSISTANT:",
            "image_content": "base64-encoded-string-of-image"
        }
    ],
    "params": {
        "do_sample": {
            "type": "bool",
            "value": "false"
        },
        "max_tokens_to_generate": {
            "type": "int",
            "value": "100"
        },
        "repetition_penalty": {
            "type": "float",
            "value": "1"
        },
        "stop_sequences": {
            "type": "str",
            "value": ""
        },
        "temperature": {
            "type": "float",
            "value": "1"
        },
        "top_k": {
            "type": "int",
            "value": "50"
        },
        "top_logprobs": {
            "type": "int",
            "value": "0"
        },
        "top_p": {
            "type": "float",
            "value": "1"
        }
    }
}'

Response

Attributes Type Description

predictions

Array

Array of response for each prompt in the input array.

stop_reason

String

Indicates when the model stopped generating subsequent tokens. Possible values can be max_len_reached, end_of_text, or stop_sequence_hit.

completion

String

The model’s prediction for the input prompt.

total_tokens_count

Float

Count of the total tokens generated by the model.

tokens

Array

Array of the tokens generated for the given prompt.

logprobs

JSON

The top N tokens by its probability to be generated. This indicates how likely was a token to be generated next. The value will be null if top_logprobs is present in the input request.

prompt

String

The prompt provided in the input.

status

JSON

Details of the request.

Example response
{
    "status": {
        "complete": true,
        "exitCode": 0,
        "elapsedTime": 2.8143582344055176,
    },
    "predictions": [
        {
            "completion": "The image shows a person standing in front of a large body of water, which could be an ocean or a lake. The person is wearing a wetsuit and appears to be preparing to go into the water.",
            "logprobs": {
                "top_logprobs": null,
                "text_offset": null
            },
            "prompt": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the humans questions. USER: What is shown in the image? ASSISTANT:",
            "stop_reason": "end_of_text",
            "tokens": [
                "The",
                "image",
                "shows",
                "a",
                "person",
                "standing",
                "in",
                "front",
                "of",
                "a",
                "large",
                "body",
                "of",
                "water",
                ",",
                "which",
                "could",
                "be",
                "an",
                "ocean",
                "or",
                "a",
                "lake",
                ".",
                "The",
                "person",
                "is",
                "we",
                "aring",
                "a",
                "w",
                "ets",
                "uit",
                "and",
                "appears",
                "to",
                "be",
                "prepar",
                "ing",
                "to",
                "go",
                "into",
                "the",
                "water",
                "."
            ],
            "total_tokens_count": 89
        }
    ]

Online ASR inference

SambaStudio allows you to deploy an endpoint for automatic speech recognition (ASR) and run online inference against it, enabling live-transcription scenarios.

To run online inference for ASR, a flac or .wav file containing the audio is sent via a HTTP POST request. The resulting transcription is then returned.

The sample rate of the audio file must be 16kHz. The file must contain no more than 15s of audio.

API reference

Request

HTTP Method Endpoint

POST

URL from the Endpoint window.

Headers

Param Description

key

The API Key from the Endpoint window.

form-data

Param Description

predict_file

Path to the .flac or .wav audio file to be transcribed.

Request template
curl -k -X POST '<your-endpoint-url>' \
-H 'key: <your-endpoint-key>' \
--form 'predict_file=<your-file-path>'

Response params (JSON)

Param Type Description

status_code

Integer

The HTTP status code for the request.

`200`for when the dataset has been added successfully.

data

String

The transcribed text.

If the request fails due to verification of certificate, use the -k flag when making your request.

Examples

The examples below demonstrate a request and a response.

Example curl request to transcribe a locally stored audio file using online ASR inference
curl -k -X POST "<your-endpoint-url>" \
-H "key:<your-endpoint-key>" \
--form 'predict_file=@"/Users/username/Downloads/1462-170138-0001.flac"'
Example response
{
  "status_code":200,
  "data":["He has written a delightful part for her and she's quite inexpressible."]
}

Online inference for other NLP tasks

For non-generative tasks, the Try It feature provides an in-platform prediction generation experience. To use the Try It feature and generate predictions, your endpoint must have reached the Live status. Follow the steps below to use the Try It feature to generate predictions.

See the Create and use endpoints document for information on how to use endpoints in the platform.

  1. From an Endpoint window, click the Try Now button.

    Try now button
    Figure 1. Try Now button
    1. The Try It window will open.

  2. Input text into the Try It window to use the following options:

    1. Click the Run button to view a response relative to the endpoint’s task.

      Try It Run
      Figure 2. Try It inputted text
    2. Click the Curl command, CLI Command, and Python SDK buttons to view how to make a request programmatically for each option.

      Try It Curl
      Figure 3. Try It Curl command

SambaStudio Swagger framework

SambaStudio implements the OpenAPI Specification (OAS) Swagger framework to describe and use its REST APIs.

Access the SambaStudio Swagger framework

To access SambaStudio’s OpenAPI Swagger framework, add /api/docs to your host server URL.

Example host URL for Swagger

http://<sambastudio-host-domain>/api/docs

Interact with the SambaStudio APIs

For the Predict and Predict File APIs, use the information described in the [Online inference for generative inference] and [Online inference for ASR] sections of this document.

You will need the following information when interacting with the SambaStudio Swagger framework.

Project ID

When you viewing a Project window, the Project ID is displayed in the browser URL path after …​/projects/details/. In the example below, cd6c07ca-2fd4-452c-bf3e-f54c3c2ead83 is the Project ID.

Example Project ID path
http://<sambastudio-host-domain>/ui/projects/details/cd6c07ca-2fd4-452c-bf3e-f54c3c2ead83

See Projects for more information.

Job ID

When you viewing a Job window, the Job ID is displayed in the browser URL path after …​/projects/details/<project-id>/jobs/. In the example below, cb1ca778-e25e-42b0-bf43-056ab34374b0 is the Job ID.

Example Job ID path
http://<sambastudio-host-domain>/ui/projects/details/cd6c07ca-2fd4-452c-bf3e-f54c3c2ead83/jobs/cb1ca778-e25e-42b0-bf43-056ab34374b0

See the Train jobs document for information on training jobs. See the Batch inference document for information on batch inference jobs.

Endpoint ID

The Endpoint ID is displayed in the URL path of the Endpoint information window. The Endpoint ID is the last sequence of numbers.

Endpoint ID
Figure 4. Endpoint ID

See Create and use endpoints for more information.

Key

The SambaStudio Swagger framework requires the SambaStudio API authorization key. This Key is generated in the Resources section of the platform. See SambaStudio resources for information on how to generate your API authorization key.

Resources
Figure 5. Resources section