LLaVA-v1.5-7b

LLaVA-1.5v-7b (Large Language and Vision Assistant) is a multimodal LLM for general-purpose image and language understanding. It can process an input image, and a task or question relavent to the image, and generate an appropriate response. With effective instruction tuning, LLaVA shows impressive multimodal chat capabilities and establishes state-of-the-art accuracy on 11 multimodal benchmarks since its release in September 2023. This model is the 7B variant within the LLaVA family.

See the model’s website External link, original paper External link, and external model card External link for more information.

Input

The input consists of two parts.

  1. The prompt to be completed by the model. The example prompt below also includes the system prompt, user prompt,USER/ASSISTANT tags, and <image>//n tag.

    Example prompt
    A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human''s questions. USER: <image>\\nDescribe this image. ASSISTANT:
  2. An encoded base64 image in JPEG (Joint Photographic Experts Group) or PNG (Portable Network Graphic) format. Encode the image to base64 describes how to encode the image file into the base64 format.

    Example basketball source image
    Figure 1. Example image

Output

The output is a concatenation of the prompt (input) string, and the generated completion.

Example output
{"status":{"complete":true,"exitCode":0,"elapsedTime":6.571182489395142,"message":"","progress":1,"progressMessage":"","reason":""},"predictions":[{"completion":"The image features two men playing basketball on a court, with one of them holding a basketball in his hand. They are both actively engaged in the game, with one player positioned closer to the left side of the court and the other player on the right side.\n\nThere are several palm trees in the background, adding a tropical touch to the scene. The basketball court is surrounded by a few benches, with one located near the left side of the court and another on the right side. The presence of these benches suggests that the court is likely a part of a larger sports facility or recreational area.","logprobs":{"text_offset":[],"top_logprobs":[]},"prompt":"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the humans questions. USER: <image>\\nDescribe this image. ASSISTANT:","stop_reason":"end_of_text","tokens":["The","image","features","two","men","playing","basketball","on","a","court",",","with","one","of","them","holding","a","basketball","in","his","hand",".","They","are","both","act","ively","engaged","in","the","game",",","with","one","player","position","ed","closer","to","the","left","side","of","the","court","and","the","other","player","on","the","right","side",".","\n","\n","There","are","several","pal","m","trees","in","the","background",",","adding","a","tropical","touch","to","the","scene",".","The","basketball","court","is","surrounded","by","a","few","ben","ches",",","with","one","located","near","the","left","side","of","the","court","and","another","on","the","right","side",".","The","presence","of","these","ben","ches","suggests","that","the","court","is","likely","a","part","of","a","larger","sports","facility","or","recre","ational","area","."],"total_tokens_count":747}]}

Hyperparameters and settings

The hyperparameters and settings for LLaVA are described below.

Parameter Definition Allowed values User Adjustable

do_sample

Toggles whether to use sampling. If not enabled, greedy decoding is used. When enabled, the platform randomly picks the next word according to its conditional probability distribution. Language generation using sampling does not remain deterministic. If you need to have deterministic results, set this to false, as the model is less likely to generate unexpected or unusual words. Setting it to true allows the model a better chance of generating a high quality response, even with inherent deficiencies. However, this is not desirable in an industrial pipeline as it can lead to more hallucinations and non-determinism.

true, false
Default: false

Yes

model_parallel_rdus

The number of RDUs to run during model parallel inference.

1

Yes

This parameter is adjustable by the user when starting an endpoint, but is not adjustable after the endpoint is created.

Only one option is currently supported.

max_seq_length

Sequence length to pad or truncate the dataset.

4096

No

max_tokens_to_generate

The maximum number of tokens to generate, ignoring the number of tokens in the prompt. When using max_tokens_to_generate, make sure your total tokens for prompt plus requested max_tokens_to_generate are not more than the supported sequence length for this model. You can use this parameter to limit the response to a certain number of tokens. The generation will stop under the following conditions:

  • When the model stops generating due to <|endoftext|>.

  • The generation encounters a stop sequence set up in the parameters.

  • The generation reaches the limit for max tokens to generate.

This should not exceed max_seq_length.

1 < int < 4096
Default: 100

Yes

model_parameter_count

The parameter count of the model. Larger models tend to have better accuracy, but run slower.

7b
Default: 7b

No

temperature

The value used to modulate the next token probabilities. As the value decreases, the model becomes more deterministic and repetitive. With a temperature between 0 and 1, the randomness and creativity of the model’s predictions can be controlled. A temperature parameter close to 1 means that the logits are passed through the softmax function without modification. If the temperature is close to 0, the highest probable tokens will become very likely compared to the other tokens: the model becomes more deterministic and will always output the same set of tokens after a given sequence of words.

0 < float < 1.0
Default: 1.0

Yes

top_k

The number of highest probability vocabulary tokens to keep for top k filtering. Top k means allowing the model to choose randomly among the top k tokens by their respective probabilities. For example, choosing the top three tokens means setting the top k parameter to a value of 3. Changing the top k parameter sets the size of the shortlist the model samples from as it outputs each token. Setting top k to 1 gives us greedy decoding.

1 < int < 1000
Default: 50

Yes

top_logprobs

Shows the top <number> (the numerical value entered) of tokens by its probability to be generated. This indicates how likely a token was to be generated next. This helps debug a given generation and see alternative options to the generated token. The highlighted token is the one that the model predicted with the list sorted by probabilities from high to low, until the top <number> is reached. On the basis of tuning other parameters, you can use the feature to analyze how the predicted tokens by the model might change.

0 < int < 20
Default: 0

Yes

top_p

Top p sampling, sometimes called nucleus sampling, is a technique used to sample possible outcomes of the model. It controls diversity via nucleus sampling as well as the randomness and originality of the model. The top p parameter specifies a sampling threshold during inference time. Top p shortlists the top tokens whose sum of likelihoods does not exceed a certain value. If set to less than 1, only the smallest set of most probable tokens with probabilities that add up to top p or higher are kept for generation.

0 < float < 1.0
Default: 1.0

Yes

run_mode

The mode to run the model. high_throughput runs with lower compute precision, which gives a faster inference speed at a potential minor degradation in accuracy. high_precision runs with high compute precision, which gives the most accurate result with slower speed. balanced runs with a balanced compute mode and balances the trade-off between speed an precision.

high_throughput

Yes

This parameter is adjustable by the user when starting an endpoint, but is not adjustable after the endpoint is created.

Only high_throughput is currently supported.

vocab_size

Maximum size of the vocabulary.

32000

No

Deploy a LLaVA-v1.5-7b endpoint

Follow the steps below to deploy an endpoint using LLaVA-v1.5-7b.

  1. Create a new project or use an existing one.

  2. From a project window, click New endpoint. The Add an endpoint box will open.

  3. Select the following settings to create the endpoint:

    1. Select Llava v1.5 from the ML App drop-down.

    2. Select a Lava model, such as llava-v1.5-7b, from the Select model drop-down.

  4. Click Add an endpoint to deploy the endpoint.

    Create endpoint
    Figure 2. Add an endpoint box
  5. The Endpoint created confirmation will display.

  6. Click View endpoint to open the endpoint details window.

    View endpoint
    Figure 3. Endpoint confirmation
  7. The status will change to Live, after a few minutes, when the endpoint is ready.

    Endpoint window
    Figure 4. LLaVA endpoint window

The Usage section describes how to interact with the endpoint.

Usage

Once the endpoint has been created in SambaStudio, you can interact with it as described below. The example curl request template below demonstrates the basic curl format. The following attributes are specified in the example curl request template:

  • <endpoint key> is the endpoint API key. The endpoint API Key can be viewed in the Endpoint window.

  • <System Prompt> provides unique instruction messages used to steer the behavior of models and their resulting outputs.

    • The system prompt used in our example prompt: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human’s questions.

  • <User Prompt> is your input query that you want the model to respond or answer.

  • <bytes> is the content of the image, serialized into bytes, and encoded in base64 format into a string. you can use the following bash command can get the encoded string.

    • Bash command: IMG_DATA=$(cat "$IMAGE_PATH" | base64)

  • <endpoint URL> describes the path to the endpoint. The endpoint URL can be viewed in the Endpoint window.

To get the best accuracy out of the model, please follow the example curl request template described below, as this is how the model is trained. The prompt must contain USER:, <image>\\n, and ASSISTANT:.

Example curl request template
curl -X POST -H 'Content-Type: application/json' -H 'key: <endpoint key>' --data '{"instances":[{"prompt":"<System Prompt>. USER: <image>\\n<User Prompt>. ASSISTANT:","image_content": "<bytes>"}],"params":{"do_sample":{"type":"bool","value":"false"},"max_tokens_to_generate":{"type":"int","value":"100"},"repetition_penalty":{"type":"float","value":"1"},"stop_sequences":{"type":"str","value":""},"temperature":{"type":"float","value":"1"},"top_k":{"type":"int","value":"50"},"top_logprobs":{"type":"int","value":"0"},"top_p":{"type":"float","value":"1"}}}' '<endpoint URL>'

Encode the image to base64

The image input will first need to be encoded to a base64 format string before it can be passed in as an input to the API. As there are several different ways to accomplish this, we have provided an example of how you can manage the construction of your entire input request via an optional bash script.

  1. Create a bash script to convert an image file into a base64 string, specify the input prompt, and hyperparameter settings.

    On your local computer, create a file named create-lava-input-request.sh with the below content to convert all required input request to a text file.

    • IMG_DATA=$(cat "$1" | base64) encodes the input image into the base64 format.

    • '{"instances" …. "top_p":{"type":"float","value":"1"}}}' is the request body as described in the example request template. All values in this section can be modified based on preference.

      Bash script template
      IMG_DATA=$(cat "$1" | base64)
      echo '{"instances":[{"prompt":"<System-Prompt>. USER: <image>\\n<User-Prompt>. ASSISTANT:","image_content": "'"$IMG_DATA"'"}], "params":{"do_sample":{"type":"bool","value":"false"},"max_tokens_to_generate":{"type":"int","value":"1000"},"temperature":{"type":"float","value":"1"},"top_k":{"type":"int","value":"50"},"top_logprobs":{"type":"int","value":"0"},"top_p":{"type":"float","value":"1"}}}' > ./<Output-File-Name>
      Completed bash example script
      IMG_DATA=$(cat "$1" | base64)
      echo '{"instances":[{"prompt":"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human''s questions. USER: <image>\\nDescribe this image. ASSISTANT:","image_content": "'"$IMG_DATA"'"}], "params":{"do_sample":{"type":"bool","value":"false"},"max_tokens_to_generate":{"type":"int","value":"1000"},"temperature":{"type":"float","value":"1"},"top_k":{"type":"int","value":"50"},"top_logprobs":{"type":"int","value":"0"},"top_p":{"type":"float","value":"1"}}}' > ./input.txt
  2. Run the script to generate the input request text file capturing the entire input request as a JSON object.

    1. From your local terminal, run the following command while passing in the image file in either JPG or PNG format.

      bash create-llava-input-request.sh <input-image-file-name>
    2. Upon completion, you will see the generated input request file for LLaVA at your specified location. When viewing it, you will see the image data ($IMG_DATA) has been encoded into a long string of characters as part of the input request.

  3. Submit the curl command with the input request to get LLaVAs response.

    • <endpoint key> is the endpoint API key. The endpoint API Key can be viewed in the Endpoint window.

    • <endpoint URL> describes the path to the endpoint. The endpoint URL can be viewed in the Endpoint window.

    • --data @/input.txt passes the request body defined in the text file (.txt) as input data when making the API call.

      Example curl command
      curl -X POST -H 'Content-Type: application/json' -H 'key: <endpoint key>' --data @/input.txt
      <endpoint URL>
      Example JSON API return
      {"status":{"complete":true,"exitCode":0,"elapsedTime":6.571182489395142,"message":"","progress":1,"progressMessage":"","reason":""},"predictions":[{"completion":"The image features two men playing basketball on a court, with one of them holding a basketball in his hand. They are both actively engaged in the game, with one player positioned closer to the left side of the court and the other player on the right side.\n\nThere are several palm trees in the background, adding a tropical touch to the scene. The basketball court is surrounded by a few benches, with one located near the left side of the court and another on the right side. The presence of these benches suggests that the court is likely a part of a larger sports facility or recreational area.","logprobs":{"text_offset":[],"top_logprobs":[]},"prompt":"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the humans questions. USER: <image>\\nDescribe this image. ASSISTANT:","stop_reason":"end_of_text","tokens":["The","image","features","two","men","playing","basketball","on","a","court",",","with","one","of","them","holding","a","basketball","in","his","hand",".","They","are","both","act","ively","engaged","in","the","game",",","with","one","player","position","ed","closer","to","the","left","side","of","the","court","and","the","other","player","on","the","right","side",".","\n","\n","There","are","several","pal","m","trees","in","the","background",",","adding","a","tropical","touch","to","the","scene",".","The","basketball","court","is","surrounded","by","a","few","ben","ches",",","with","one","located","near","the","left","side","of","the","court","and","another","on","the","right","side",".","The","presence","of","these","ben","ches","suggests","that","the","court","is","likely","a","part","of","a","larger","sports","facility","or","recre","ational","area","."],"total_tokens_count":747}]}