CLIP
CLIP is a multimodal model that understands both images and text. It is trained on diverse image-text pairs, enabling classification without prior knowledge of specific classes. Its visual and text encoders process inputs, producing fixed-size vector representations compared via a contrastive loss function. The resulting multimodal representation proves versatile for various tasks like text and image retrievals, image classification, and object detection, showcasing its broad applicability across visual and textual domains.
Input
An image in any standard format, such as JPEG (Joint Photographic Experts Group) or PNG (Portable Network Graphic), passed to the endpoint as a file. Or, a text string.
Output
The output will be a dictionary containing a data
key, with a corresponding value being a list of tensor values that correspond to the embeddings output from the image/text encoder.
{'data': [[-0.024150066077709198, -0.05926524102687836, -0.04773445427417755, -0.01874605007469654, -0.006788679398596287, -0.00362512469291687, 0.010333946906030178, 0.020949017256498337, -0.38842734694480896 ...0.000597264850512147]], 'status_code': 200}
Hyperparameters and settings
The hyperparameters and settings for CLIP are described below.
Parameter | Definition | Allowed values | User Adjustable |
---|---|---|---|
|
Number of training samples/data points processed together. In this case, it is currently only supported to submit single images/text strings. |
1 |
No |
Deploy a CLIP model
Follow the steps below to deploy an endpoint using CLIP.
-
Create a new project or use an existing one.
-
From a project window, click New endpoint. The Add an endpoint box will open.
-
Select the following settings to create the endpoint:
-
Select CLIP from the ML App drop-down.
-
Select a CLIP model, such as OpenCLIP CLIP-ViT-B-32 Backbone, from the Select model drop-down.
-
-
Click Add an endpoint to deploy the endpoint.
Figure 1. Add an endpoint box -
The Endpoint created confirmation will display.
-
Click View endpoint to open the endpoint details window.
Figure 2. Endpoint confirmation -
The status will change to Live, after a few minutes, when the endpoint is ready.
Figure 3. Clip endpoint window
Please refer to the Usage section for instructions on how to interact with the endpoint.
Usage
Once the endpoint has been created in SambaStudio, the following format(s) can be used to interact with the endpoint.
CLIP can take either an image or a text string as the input to generate the corresponding embedding representation as the output. Depending on the modality of the input, you will need to use a slightly different URL path in the CURL cmd.
Use an image as the input
The example below demonstrates how to pass an image as an input.
Only the multipart/form-data format is supported. The ability to pass in an image in the Base64 format as a json payload is not currently supported. |
For an image input, be sure the path in your command includes |
curl -X POST \
-H 'key:<your_API_key>' \
--form 'predict_file=@"<PATH_TO_JPG_FILE>"' \
<API address: <domain-address>/api/predict/file/<project_key>/<API_key>
Use text as the input
The template below demonstrates how to pass text as an input.
For a text input, be sure the path in your command includes |
curl -X POST \
-H 'key:<your_API_key>' \
-H 'Content-Type: application/json' \
--data '{"inputs":["Today is a nice day."]}' \
<API address: <domain-address>/api/predict/nlp/<project_key>/<API_key>
Output
Whether using an image or text for the input, the resulting output will be the embedding for the input, as shown below.
{"data": [[0.007865619845688343, 0.1614878624677658, -0.057749394327402115, -0.023593805730342865, -0.027492167428135872, ... 0.00047873161383904517, -0.038996435701847076, -0.002148156752809882, 0.03880875185132027, 0.04024270176887512]]}