CLIP is a multimodal model that understands both images and text. It is trained on diverse image-text pairs, enabling classification without prior knowledge of specific classes. Its visual and text encoders process inputs, producing fixed-size vector representations compared via a contrastive loss function. The resulting multimodal representation proves versatile for various tasks like text and image retrievals, image classification, and object detection, showcasing its broad applicability across visual and textual domains.


An image in any standard format, such as JPEG (Joint Photographic Experts Group) or PNG (Portable Network Graphic), passed to the endpoint as a file. Or, a text string.


The output will be a dictionary containing a data key, with a corresponding value being a list of tensor values that correspond to the embeddings output from the image/text encoder.

Example raw output
{'data': [[-0.024150066077709198, -0.05926524102687836, -0.04773445427417755, -0.01874605007469654, -0.006788679398596287, -0.00362512469291687, 0.010333946906030178, 0.020949017256498337, -0.38842734694480896 ...0.000597264850512147]], 'status_code': 200}

Hyperparameters and settings

The hyperparameters and settings for CLIP are described below.

Deploy a CLIP model

Follow the steps below to deploy an endpoint using CLIP.

  1. Create a new project or use an existing one.

  2. From a project window, click New endpoint. The Add an endpoint box will open.

  3. Select the following settings to create the endpoint:

    1. Select CLIP from the ML App drop-down.

    2. Select a CLIP model, such as OpenCLIP CLIP-ViT-B-32 Backbone, from the Select model drop-down.

  4. Click Add an endpoint to deploy the endpoint.

    Add an endpoint
    Figure 1. Add an endpoint box
  5. The Endpoint created confirmation will display.

  6. Click View endpoint to open the endpoint details window.

    View endpoint
    Figure 2. Endpoint confirmation
  7. The status will change to Live, after a few minutes, when the endpoint is ready.

    Endpoint window
    Figure 3. Clip endpoint window

Please refer to the Usage section for instructions on how to interact with the endpoint.


Once the endpoint has been created in SambaStudio, the following format(s) can be used to interact with the endpoint.

CLIP can take either an image or a text string as the input to generate the corresponding embedding representation as the output. Depending on the modality of the input, you will need to use a slightly different URL path in the CURL cmd.

Use an image as the input

The example below demonstrates how to pass an image as an input.

Only the multipart/form-data format is supported. The ability to pass in an image in the Base64 format as a json payload is not currently supported.

For an image input, be sure the path in your command includes file, as shown in the example below.

Example command for image input
curl -X POST \
     -H 'key:<your_API_key>' \
     --form 'predict_file=@"<PATH_TO_JPG_FILE>"' \
     <API address: <domain-address>/api/predict/file/<project_key>/<API_key>

Use text as the input

The template below demonstrates how to pass text as an input.

For a text input, be sure the path in your command includes nlp, as shown in the example below.

Example command for text input
curl -X POST \
     -H 'key:<your_API_key>' \
     -H 'Content-Type: application/json' \
     --data '{"inputs":["Today is a nice day."]}' \
     <API address: <domain-address>/api/predict/nlp/<project_key>/<API_key>


Whether using an image or text for the input, the resulting output will be the embedding for the input, as shown below.

Example output
{"data": [[0.007865619845688343, 0.1614878624677658, -0.057749394327402115, -0.023593805730342865, -0.027492167428135872,  ... 0.00047873161383904517, -0.038996435701847076, -0.002148156752809882, 0.03880875185132027, 0.04024270176887512]]}