LLaVA-v1.5-7b

LLaVA-1.5v-7b (Large Language and Vision Assistant) is a multimodal LLM for general-purpose image and language understanding. It can process an input image, and a task or question relavent to the image, and generate an appropriate response. With effective instruction tuning, LLaVA shows strong multimodal chat capabilities. This model is the 7B variant within the LLaVA family.

See the model’s website External link, original paper External link, and external model card External link for more information.

Input

The input consists of two parts.

  1. The prompt to be completed by the model, as shown in the example below.

    Describe this image.
  2. An encoded base64 image in JPEG (Joint Photographic Experts Group) or PNG (Portable Network Graphic) format. Encode the image to base64 describes how to encode the image file into the base64 format.

    The REST endpoint only supports images up to about 1000 x 1000 pixels. Images above this resolution may run into errors. On the server side, all images will be downsized to 336 x 336 pixels, the standard for LLaVA 1.5 7b.

    Example basketball source image
    Figure 1. Example image

Behind the scenes, we substitute the provided prompt and the provided image into a template with a predefined system prompt. As shown in the examples below, the encoded image information is substituted for <image> and the textual prompt provided is substituted for <prompt>.

The example system prompt we use is:

`A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.`

You can select between two templates when deploying an endpoint, llava-hf or llava-sn.

llava-hf

The llava-hf template is defined on Hugging Face. It will be required to match expected accuracy with the provided off-the-shelf checkpoint.

Example llava-hf template
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image>
<prompt> ASSISTANT:
llava-sn

The llava-sn template is the prompt we have elected to use for the fine-tuning workflow. It will be required for matching inference accuracy on any checkpoints trained through SambaStudio. The markdown nature of the template more naturally lends itself to complicated VQA tasks like Document or Chart Understanding.

Example llava-sn template
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

### Human: <prompt>
<image>

###

<image> in the llava-sn template actually consist of three special tokens:

  • <im_start>

  • <im_patch>

  • <im_end>

When using SambaStudio, you will not need to worry about these special tokens. However, if you are running a comparison outside of SambaStudio, these special tokens will need to be included for consistency.

See this GitHub document for more information: https://github.com/microsoft/LLaVA-Med External link

Output

The output is a concatenation of the prompt (input) string, and the generated completion.

Example output
{
  "status":{
    "complete":true,
    "exitCode":0,
    "elapsedTime":6.571182489395142,
    "message":"",
    "progress":1,
    "progressMessage":"",
    "reason":""
  },
  "predictions":[
    {
      "completion":"The image features two men playing basketball on a court, with one of them holding a basketball in his hand. They are both actively engaged in the game, with one player positioned closer to the left side of the court and the other player on the right side.\n\nThere are several palm trees in the background, adding a tropical touch to the scene. The basketball court is surrounded by a few benches, with one located near the left side of the court and another on the right side. The presence of these benches suggests that the court is likely a part of a larger sports facility or recreational area.",
      "logprobs":{
        "text_offset":[

        ],
        "top_logprobs":[

        ]
      },
      "prompt":"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the humans questions. USER: <image>\\nDescribe this image. ASSISTANT:",
      "stop_reason":"end_of_text",
      "tokens":[ "The", "image", "features", "two", "men", "playing", "basketball", "on", "a", "court", ",", "with", "one", "of", "them", "holding", "a", "basketball", "in", "his", "hand", ".", "They", "are", "both", "act", "ively", "engaged", "in", "the", "game", ",", "with", "one", "player", "position", "ed", "closer", "to", "the", "left", "side", "of", "the", "court", "and", "the", "other", "player", "on", "the", "right", "side", ".", "\n", "\n", "There", "are", "several", "pal", "m", "trees", "in", "the", "background", ",", "adding", "a", "tropical", "touch", "to", "the", "scene", ".", "The", "basketball", "court", "is", "surrounded", "by", "a", "few", "ben", "ches", ",", "with", "one", "located", "near", "the", "left", "side", "of", "the", "court", "and", "another", "on", "the", "right", "side", ".", "The", "presence", "of", "these", "ben", "ches", "suggests", "that", "the", "court", "is", "likely", "a", "part", "of", "a", "larger", "sports", "facility", "or", "recre", "ational", "area", "."
      ],
      "total_tokens_count":747
    }
  ]
}

Hyperparameters and settings

The hyperparameters and settings for LLaVA are described below.

Parameter Definition Allowed values User adjustable

do_sample

Toggles whether to use sampling. If not enabled, greedy decoding is used. When enabled, the platform randomly picks the next word according to its conditional probability distribution. Language generation using sampling does not remain deterministic. If you need to have deterministic results, set this to false, as the model is less likely to generate unexpected or unusual words. Setting it to true allows the model a better chance of generating a high quality response, even with inherent deficiencies. However, this is not desirable in an industrial pipeline as it can lead to more hallucinations and non-determinism.

true, false
Default: false

Yes

model_parallel_rdus

The number of RDUs to run during model parallel inference.

1

Yes

This parameter is adjustable by the user when starting an endpoint, but is not adjustable after the endpoint is created.

Only one option is currently supported.

max_seq_length

Sequence length to pad or truncate the dataset.

4096

No

max_tokens_to_generate

The maximum number of tokens to generate, ignoring the number of tokens in the prompt. When using max_tokens_to_generate, make sure your total tokens for prompt plus requested max_tokens_to_generate are not more than the supported sequence length for this model. You can use this parameter to limit the response to a certain number of tokens. The generation will stop under the following conditions:

  • When the model stops generating due to <|endoftext|>.

  • The generation encounters a stop sequence set up in the parameters.

  • The generation reaches the limit for max tokens to generate.

This should not exceed max_seq_length.

1 < int < 4096
Default: 100

Yes

model_parameter_count

The parameter count of the model. Larger models tend to have better accuracy, but run slower.

7b
Default: 7b

No

temperature

The value used to modulate the next token probabilities. As the value decreases, the model becomes more deterministic and repetitive. With a temperature between 0 and 1, the randomness and creativity of the model’s predictions can be controlled. A temperature parameter close to 1 means that the logits are passed through the softmax function without modification. If the temperature is close to 0, the highest probable tokens will become very likely compared to the other tokens: the model becomes more deterministic and will always output the same set of tokens after a given sequence of words.

0 < float < 1.0
Default: 1.0

Yes

top_k

The number of highest probability vocabulary tokens to keep for top k filtering. Top k means allowing the model to choose randomly among the top k tokens by their respective probabilities. For example, choosing the top three tokens means setting the top k parameter to a value of 3. Changing the top k parameter sets the size of the shortlist the model samples from as it outputs each token. Setting top k to 1 gives us greedy decoding.

1 < int < 1000
Default: 50

Yes

top_logprobs

Shows the top <number> (the numerical value entered) of tokens by its probability to be generated. This indicates how likely a token was to be generated next. This helps debug a given generation and see alternative options to the generated token. The highlighted token is the one that the model predicted with the list sorted by probabilities from high to low, until the top <number> is reached. On the basis of tuning other parameters, you can use the feature to analyze how the predicted tokens by the model might change.

0 < int < 20
Default: 0

Yes

top_p

Top p sampling, sometimes called nucleus sampling, is a technique used to sample possible outcomes of the model. It controls diversity via nucleus sampling as well as the randomness and originality of the model. The top p parameter specifies a sampling threshold during inference time. Top p shortlists the top tokens whose sum of likelihoods does not exceed a certain value. If set to less than 1, only the smallest set of most probable tokens with probabilities that add up to top p or higher are kept for generation.

0 < float < 1.0
Default: 1.0

Yes

run_mode

The mode to run the model. high_throughput runs with lower compute precision, which gives a faster inference speed at a potential minor degradation in accuracy. high_precision runs with high compute precision, which gives the most accurate result with slower speed. balanced runs with a balanced compute mode and balances the trade-off between speed an precision.

high_throughput

Yes

This parameter is adjustable by the user when starting an endpoint, but is not adjustable after the endpoint is created.

Only high_throughput is currently supported.

vocab_size

Maximum size of the vocabulary.

32000

No

preprocessing

The preprocessing scheme to use. Options are llava-hf or llava-sn.

llava-hf is the preprocessing scheme used by Hugging Face and is structured as shown below.

<SYSTEM PROMPT>

USER:<image>
<USER PROMPT>

ASSISTANT:

llava-sn is the preprocessing scheme used by SambaStudio LLaVA training (as well as the llava-med project) and is structured as shown below.

<SYSTEM PROMPT>

### Human: <USER PROMPT>
<image>

###

llava-hf

llava-sn

Yes

Deploy a LLaVA-v1.5-7b endpoint

Follow the steps below to deploy an endpoint using LLaVA-v1.5-7b.

See Create a non-CoE endpoint using the GUI for detailed information about creating and deploying endpoints.

  1. Create a new project or use an existing one.

  2. From a project window, click New endpoint. The Add an endpoint box will open.

  3. Select the following settings to create the endpoint:

    1. Select Llava v1.5 from the ML App drop-down.

    2. Select a Lava model, such as llava-v1.5-7b, from the Select model drop-down.

  4. Click Add an endpoint to deploy the endpoint.

  5. The Endpoint created confirmation will display.

  6. Click View endpoint to open the endpoint details window.

  7. The status will change to Live, after a few minutes, when the endpoint is ready.

Usage

Once the endpoint has been created in SambaStudio, you can interact with it as described below.

Python example

import argparse
import requests
import json
import base64

data = {
  "instances": [
    {
      "prompt": "",
      "image_content": ""
    }
  ],
  "params": {
    "do_sample": {
      "type": "bool",
      "value": "false"
    },
    "max_tokens_to_generate": {
      "type": "int",
      "value": "1000"
    },
    "temperature": {
      "type": "float",
      "value": "1"
    },
    "top_k": {
      "type": "int",
      "value": "50"
    },
    "top_logprobs": {
      "type": "int",
      "value": "0"
    },
    "preprocessing":{
       "type":"str",
       "value": "llava-hf" # will need to be "llava-sn" for a pretraining checkpont
    },
    "top_p": {
      "type": "float",
      "value": "1"
    }
  }
}

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--image_path', type=str, help="Path to the image file to be encoded in base64")
    parser.add_argument('--prompt', type=str, help="Textual Prompt")

    args = parser.parse_args()

    url = "<URL Endpoint>"
    headers = {
        'Content-Type': 'application/json',
        'key': '<endpoint key>'
    }

    # Read the image and encode it in base64
    with open(args.image_path, 'rb') as image_file:
        encoded_image = base64.b64encode(image_file.read()).decode('utf-8')

    # Populate the 'image_content' field with the base64 encoded image
    data['instances'][0]['image_content'] = encoded_image
    data['instances'][0]['prompt'] = args.prompt

    payload = json.dumps(data, indent=4)

    response = requests.post(url, headers=headers, data=payload)
    if response.status_code == 200:
        content = json.loads(response.content)
        print(content['predictions'][0]['completion'])
    else:
        print(response.content)


if __name__ == "__main__":
    main()

python llava.py --image_path /path/to/image.jpg --prompt "Describe this image."

Curl example

The example curl request template below demonstrates the basic curl format. The following attributes are specified in the example curl request template:

  • <endpoint URL> describes the path to the endpoint. The endpoint URL can be viewed in the Endpoint window.

  • <endpoint key> is the endpoint API key. The endpoint API Key can be viewed in the Endpoint window.

  • <prompt> is your input query that you want the model to respond or answer.

  • <image_content> is the content of the image, serialized into bytes, and encoded in base64 format into a string. You can use the following bash command to get the encoded string.

    • Bash command: IMG_DATA=$(cat "$IMAGE_PATH" | base64)

Example curl request template
curl -X POST "<endpoint URL>" \
-H "Content-Type: application/json" \
-H "key: <endpoint key>" \
-d '{
  "instances": [
    {
      "prompt": "'"$PROMPT"'",
      "image_content": "'"$(base64 < $IMAGE_PATH)"'"
    }
  ],
  "params": {
    "do_sample": {
      "type": "bool",
      "value": "false"
    },
    "max_tokens_to_generate": {
      "type": "int",
      "value": "1000"
    },
    "temperature": {
      "type": "float",
      "value": "1"
    },
    "top_k": {
      "type": "int",
      "value": "50"
    },
    "top_logprobs": {
      "type": "int",
      "value": "0"
    },
    "top_p": {
      "type": "float",
      "value": "1"
    }
  }
}'

Encode the image to base64

The image input will first need to be encoded to a base64 format string before it can be passed in as an input to the API. As there are several different ways to accomplish this, we have provided an example of how you can manage the construction of your entire input request via an optional bash script.

  1. Create a bash script to convert an image file into a base64 string, specify the input prompt, and hyperparameter settings.

    On your local computer, create a file named create-lava-input-request.sh with the below content to convert all required input request to a text file.

    • IMG_DATA=$(cat "$1" | base64) encodes the input image into the base64 format.

    • '{"instances" …. "top_p":{"type":"float","value":"1"}}}' is the request body as described in the example request template. All values in this section can be modified based on preference.

      Bash script template
      IMG_DATA=$(cat "$1" | base64)
      echo '{"instances":[{"prompt":"<User-Prompt>.","image_content": "'"$IMG_DATA"'"}], "params":{"do_sample":{"type":"bool","value":"false"},"max_tokens_to_generate":{"type":"int","value":"1000"},"temperature":{"type":"float","value":"1"},"top_k":{"type":"int","value":"50"},"top_logprobs":{"type":"int","value":"0"},"top_p":{"type":"float","value":"1"}}}' > ./input.txt
  2. Run the script to generate the input request text file capturing the entire input request as a JSON object.

    1. From your local terminal, run the following command while passing in the image file in either JPG or PNG format.

      bash create-llava-input-request.sh <input-image-file-name>
    2. Upon completion, you will see the generated input request file for LLaVA at your specified location. When viewing it, you will see the image data ($IMG_DATA) has been encoded into a long string of characters as part of the input request.

  3. Submit the curl command with the input request to get LLaVAs response.

    • <endpoint key> is the endpoint API key. The endpoint API Key can be viewed in the Endpoint window.

    • <endpoint URL> describes the path to the endpoint. The endpoint URL can be viewed in the Endpoint window.

    • --data @/input.txt passes the request body defined in the text file (.txt) as input data when making the API call.

      Example curl command
      curl -X POST -H 'Content-Type: application/json' -H 'key: <endpoint key>' --data @/input.txt
      <endpoint URL>
      Example JSON API return
      {"status":{"complete":true,"exitCode":0,"elapsedTime":6.571182489395142,"message":"","progress":1,"progressMessage":"","reason":""},"predictions":[{"completion":"The image features two men playing basketball on a court, with one of them holding a basketball in his hand. They are both actively engaged in the game, with one player positioned closer to the left side of the court and the other player on the right side.\n\nThere are several palm trees in the background, adding a tropical touch to the scene. The basketball court is surrounded by a few benches, with one located near the left side of the court and another on the right side. The presence of these benches suggests that the court is likely a part of a larger sports facility or recreational area.","logprobs":{"text_offset":[],"top_logprobs":[]},"prompt":"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the humans questions. USER: <image>\\nDescribe this image. ASSISTANT:","stop_reason":"end_of_text","tokens":["The","image","features","two","men","playing","basketball","on","a","court",",","with","one","of","them","holding","a","basketball","in","his","hand",".","They","are","both","act","ively","engaged","in","the","game",",","with","one","player","position","ed","closer","to","the","left","side","of","the","court","and","the","other","player","on","the","right","side",".","\n","\n","There","are","several","pal","m","trees","in","the","background",",","adding","a","tropical","touch","to","the","scene",".","The","basketball","court","is","surrounded","by","a","few","ben","ches",",","with","one","located","near","the","left","side","of","the","court","and","another","on","the","right","side",".","The","presence","of","these","ben","ches","suggests","that","the","court","is","likely","a","part","of","a","larger","sports","facility","or","recre","ational","area","."],"total_tokens_count":747}]}

Training

Training LLM is the process of improving an existing LLM for a specific task or domain. You can improve LLaVA 1.5 by giving it a set of labeled examples for that task which it can then learn from. The examples can come from public datasets on the internet, or private datasets unique to your organization.

Data prep

The directory structure of your dataset should be like the below example.

Example directory structure
llava-dataset
├── images/
│   ├── image1.jpg
│   ├── image2.jpg
│   └── image3.jpg
├── annotations_train.json
└── annotations_val.json

The annotation files themselves should be structured as a JSON list as shown below.

Example JSON annotation files
{
  "id": "bdfd624a-c348-40f7-b986-b0569d295e97",
  "image": "monalisa.png",
  "conversations": [
    {"from": "human", "value": "Who painted this image?\n<image>"},
    {"from": "gpt", "value": "Leonardo Da Vinci"}
  ]
}

The dataset will be loaded into a template with the format shown below.

Example dataset template
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

### Human: <prompt>
<image>

### Assistant: {value}

See the following document for additional examples: https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K External link