BLOOMChat

BLOOMChat is a 176 billion parameter multilingual chat model. It is instruction tuned from BLOOM (176B) on assistant-style conversation datasets and supports conversation, question answering, and generative answers in multiple languages.

Data preparation

The BLOOMChat training repo describes how to prepare data to be used to train SambaNova’s BLOOMChat model. The repo includes data preparation, tokenization, and the training and inference code, for BLOOMChat.

To access the data preparation package, including its associated documentation, please visit the SambaNova public GitHub using the following link: https://github.com/sambanova/bloomchat

Figure 1. BLOOMChat training repo on SambaNova public GitHub

Training and validation requirements

The expected requirements for training and validation datasets are described below.

The expected format of the data type is an HDF5 format, which is processed from the generative data preparation.
Example data directory path
```
DATA_DIR=../dataset_prep/oasst1_dolly_out/hdf5
```
Use tags to indicate which parts of conversation are from the user and which are from the model, as demonstrated in the repo .
- Use <human> to indicate user parts of the conversation.
- Use <bot> to indicate model parts of the conversation.
  Example tags
  text = f"<human>: {item['instruction']}\n<bot>: {item['response']}"
The code to load and get datasets is demonstrated in the repo .

Batch inference requirements

The expected format for batch inference is a JSON file, or directory of JSON files, as demonstrated in the repo .

Example

--data_dir /import/mlds-sc-speech/daas/datasets/toy_generative

Prompt guidelines

The desired stop generations for stop tokens is <human>. This helps avoid model hallucinations of new conversation dialogs, due to the way BLOOMChat is trained to continue with conversations.

For example, given the prompt <human>: how are you, the model will generate a whole continuation that includes both parties.

Continuation example

<bot>: I am good! How are you?
<human>: doing great! You like ice cream?
<bot>: You bet!

This is because the model will only stop generating responses until it either hits a stop token or the maximum number of tokens it is allowed to generate. To ensure a meaningful response, we want to return the response to the first location of the <bot> tag. The example below demonstrates a simple string truncation to accomplish this.

Simple string truncation example

generated_output = "<bot>: I am good! How are you?\n<human>: doing great! You like ice cream?\n<bot>: You bet!"

#want to get the first element
true_response = generated_output.split("<human>")[0]
print(true_response)

#Output of true_response: <bot>: I am good! How are you?\n

Prompts for chat

You will need to prepend the <human>: tag for anything the user has said or wants to say.
Additionally, append the <bot>: tag with no text following the tag to describe what the model should generate.

To have a conversation history, prepend the necessary tags to the conversation as shown in the example below.

Conversation history example

<human>: First question that was asked
<bot>: Model answer to the first question
<human>: Question that you want answered
<bot>:

Example prompts

The examples below demonstrate prompts for BLOOMchat. Each example is identified by a task type. Click the triangle to expand and view each example prompt.

Closed-book QA

<human>: What does it mean for the Fed to raise or lower interest rates?
<bot>: The Federal Reserve (also known as the Fed) is the central banking system of the United States. It was created in 1913, and is an independent agency of the federal government. The Fed has a dual mandate of maximizing employment and stabilizing prices. The Fed sets interest rates, and can influence the money supply and the credit markets through its actions.

Raising interest rates means that the Fed is trying to reduce the money supply in the economy, and is trying to discourage borrowing and encourage saving. Lowering interest rates means that the Fed is trying to stimulate economic growth by encouraging borrowing and reducing the cost of credit.

The Fed raises or lowers interest rates in order to influence the economy and achieve its goals for employment and price stability. Interest rates are one of the primary tools that the Fed uses to achieve its goals.

Brainstorming

<human>: Write a script in which Bob accidentally breaks his dad’s guitar
<bot>: Bob was in his room, practicing for his upcoming guitar solo, when he accidentally stepped on the guitar, causing it to fall and break. His dad, who was in the other room, heard the commotion and came running into Bob’s room, visibly angry. Bob tried to explain what had happened, but his dad wasn’t having any of it. He yelled at Bob, calling him careless and irresponsible, and told him he was grounded for a week. Bob was devastated, as this would mean missing his solo and disappointing his friends and family. He knew he was in trouble, but he couldn’t help but feel a little bit guilty too.

Tourism

<human>: List the top ten places to visit in Delhi
<bot>: 1. Qutub Minar
2. Red Fort
3. India Gate
4. Jama Masjid
5. Lotus Temple
6. Akshardham Temple
7. Humayun’s Tomb
8. Purana Quila
9. National Museum
10. Raj Ghat

Sentiment analysis

<human>: Classify the sentiment of the following sentence into Positive, Neutral, or Negative. Do it on a scale of 1/10: How about the following sentence: It is raining outside and I feel so blue
<bot>: The sentiment of the sentence "It is raining outside and I feel so blue" is negative, with a score of 9/10.

Information extraction

<human>: Answer the following question using the context below:
Q: Which regulatory body is involved? Context: U.S. authorities launched emergency measures on Sunday to shore up confidence in the banking system after the failure of Silicon Valley Bank (SIVB.O) threatened to trigger a broader financial crisis. After a dramatic weekend, regulators said the failed bank’s customers will have access to all their deposits starting Monday and set up a new facility to give banks access to emergency funds. The Federal Reserve also made it easier for banks to borrow from it in emergencies. While the measures provided some relief for Silicon Valley firms and global markets on Monday, worries about broader banking risks remain and have cast doubts over whether the Fed will stick with its plan for aggressive interest rate hikes.
<bot>: The relevant regulatory body in this context is the Federal Reserve.

QA (concept explanation)

<human>: Tell me something about organic photovoltaic
<bot>: Organic photovoltaics (OPVs) are a type of photovoltaic (PV) technology that use organic materials as the active layer in the device. Organic materials have the advantage of being lightweight, flexible, and low-cost compared to traditional inorganic semiconductors used in photovoltaics. The active layer in an OPV device absorbs photons from sunlight, creating electron-hole pairs that are separated by the device’s internal electric field, generating a flow of electrons that can be collected and used to power a device. The active layer in an OPV device is typically a blend of organic semiconducting materials and can be printed or sprayed onto a substrate using low-cost manufacturing techniques. This makes OPVs an attractive technology for applications where traditional photovoltaics are not suitable, such as on building or vehicle surfaces, or on flexible substrates.

Usage

BLOOMChat was trained with the human bot templating in mind. Follow the prompt guidelines to get the most out of the model.

Use this checkpoint for chatbot related tasks.
Use this checkpoint with few-shot prompting to dictate how you want the model to respond.
Use this checkpoint with context retrieval to get reduced hallucination results.

Few-shot prompting

Few-shot describes a prompting technique that provides a model with examples to process before attempting a task. A few-shot prompt typically includes several examples of problem/solution pairs, known as shots, which provide demonstrations to the model. The demonstrations prime the model for subsequent generative responses.

Playground tuning parameters

The Playground tuning parameters provide additional flexibility and options for generative tuning.

We recommend setting Do sampling to On when using BLOOMChat in the Playground. This is to ensure you get varied results when chatting with the model.

Inference settings

The inference settings for human aligned (HA) models when creating a batch inference job are described below.

Parameter Definition Allowed values

Parameter	Definition	Allowed values
`do_sample`	Toggles whether to use sampling. If not enabled, greedy decoding is used. When enabled, the platform randomly picks the next word according to its conditional probability distribution. Language generation using sampling does not remain deterministic. If you need to have deterministic results, set this to off, as the model is less likely to generate unexpected or unusual words. Setting it to on allows the model a better chance of generating a high quality response, even with inherent deficiencies. However, this is not desirable in an industrial pipeline as it can lead to more hallucinations and non-determinism.	true, false Setting to true is recommended. Set to false to evaluate the model for benchmark tests.
`max_tokens_to_generate`	The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. When using max tokens to generate, make sure your total tokens for the prompt plus the requested max tokens to generate are not more than the supported sequence length of the model. You can use this parameter to limit the response to a certain number of tokens. The generation will stop under the following conditions: When the model stops generating due to `<\|endoftext\|>`. The generation encounters a stop sequence set up in the parameters. The generation reaches the limit for max tokens to generate.	1 → `max_sequence_length` of model
`repetition_penalty`	The repetition penalty, also known as frequency penalty, controls the model’s tendency to repeat predictions. The repetition penalty reduces the probability of words that have previously been generated. The penalty depends on how many times a word has previously occurred in the prediction. This parameter can be used to penalize words that were previously generated or belong to the context. It decreases the model’s likelihood to repeat the same line verbatim.	Between 1 and 2. A value setting of 1 means no penalty.
`stop_sequences`	Stop sequences are used to make the model stop generating text at a desired point, such as the end of a sentence or a list. It is an optional setting that tells the API when to stop generating tokens. The completion will not contain the stop sequence. If nothing is passed, it defaults to the token `<\|endoftext\|>`. This token represents a probable stopping point in the text.	Any comma separated strings, each stop word must be enclosed in double quotes. Using `<human>:` as a stop sequence is preferred.
`temperature`	The value used to modulate the next token probabilities. As the value decreases, the model becomes more deterministic and repetitive. With a temperature between `0` and `1`, the randomness and creativity of the model’s predictions can be controlled. A temperature parameter close to `1` would mean that the logits are passed through the softmax function without modification. If the temperature is close to `0`, the highest probable tokens will become very likely compared to the other tokens: the model becomes more deterministic and will always output the same set of tokens after a given sequence of words.	0 < x ⇐ 1
`top_k`	The number of highest probability vocabulary tokens to keep for top k filtering. Top k means allowing the model to choose randomly among the top k tokens by their respective probabilities. For example, choosing the top three tokens means setting the top k parameter to a value of `3`. Changing the top k parameter sets the size of the shortlist the model samples from as it outputs each token. Setting top k to `1` gives us greedy decoding.	1 ⇐ x ⇐ `vocab_size`
`top_logprobs`	Shows the top `<number>` (the numerical value entered) of tokens by its probability to be generated. This indicates how likely a token was to be generated next. This helps debug a given generation and see alternative options to the generated token. The highlighted token is the one that the model predicted with the list sorted by probabilities from high to low, until the top `<number>` is reached. On the basis of tuning other parameters, you can use the feature to analyze how the predicted tokens by the model might change.	0 ⇐ x ⇐ 20
`top_p`	Top p sampling, sometimes called nucleus sampling, is a technique used to sample possible outcomes of the model. It controls diversity via nucleus sampling as well as the randomness and originality of the model. The top p parameter specifies a sampling threshold during inference time. Top p shortlists the top tokens whose sum of likelihoods does not exceed a certain value. If set to less than `1`, only the smallest set of most probable tokens with probabilities that add up to top p or higher are kept for generation.	0 < x ⇐ 1

do_sample

Toggles whether to use sampling. If not enabled, greedy decoding is used. When enabled, the platform randomly picks the next word according to its conditional probability distribution. Language generation using sampling does not remain deterministic. If you need to have deterministic results, set this to off, as the model is less likely to generate unexpected or unusual words. Setting it to on allows the model a better chance of generating a high quality response, even with inherent deficiencies. However, this is not desirable in an industrial pipeline as it can lead to more hallucinations and non-determinism.

true, false

Setting to true is recommended. Set to false to evaluate the model for benchmark tests.

max_tokens_to_generate

The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. When using max tokens to generate, make sure your total tokens for the prompt plus the requested max tokens to generate are not more than the supported sequence length of the model. You can use this parameter to limit the response to a certain number of tokens. The generation will stop under the following conditions:

When the model stops generating due to <|endoftext|>.
The generation encounters a stop sequence set up in the parameters.
The generation reaches the limit for max tokens to generate.

1 → max_sequence_length of model

repetition_penalty

The repetition penalty, also known as frequency penalty, controls the model’s tendency to repeat predictions. The repetition penalty reduces the probability of words that have previously been generated. The penalty depends on how many times a word has previously occurred in the prediction. This parameter can be used to penalize words that were previously generated or belong to the context. It decreases the model’s likelihood to repeat the same line verbatim.

Between 1 and 2.

A value setting of 1 means no penalty.

stop_sequences

Stop sequences are used to make the model stop generating text at a desired point, such as the end of a sentence or a list. It is an optional setting that tells the API when to stop generating tokens. The completion will not contain the stop sequence. If nothing is passed, it defaults to the token <|endoftext|>. This token represents a probable stopping point in the text.

Any comma separated strings, each stop word must be enclosed in double quotes.

Using <human>: as a stop sequence is preferred.

temperature

The value used to modulate the next token probabilities. As the value decreases, the model becomes more deterministic and repetitive. With a temperature between 0 and 1, the randomness and creativity of the model’s predictions can be controlled. A temperature parameter close to 1 would mean that the logits are passed through the softmax function without modification. If the temperature is close to 0, the highest probable tokens will become very likely compared to the other tokens: the model becomes more deterministic and will always output the same set of tokens after a given sequence of words.

0 < x ⇐ 1

top_k

The number of highest probability vocabulary tokens to keep for top k filtering. Top k means allowing the model to choose randomly among the top k tokens by their respective probabilities. For example, choosing the top three tokens means setting the top k parameter to a value of 3. Changing the top k parameter sets the size of the shortlist the model samples from as it outputs each token. Setting top k to 1 gives us greedy decoding.

1 ⇐ x ⇐ vocab_size

top_logprobs

Shows the top <number> (the numerical value entered) of tokens by its probability to be generated. This indicates how likely a token was to be generated next. This helps debug a given generation and see alternative options to the generated token. The highlighted token is the one that the model predicted with the list sorted by probabilities from high to low, until the top <number> is reached. On the basis of tuning other parameters, you can use the feature to analyze how the predicted tokens by the model might change.

0 ⇐ x ⇐ 20

top_p

Top p sampling, sometimes called nucleus sampling, is a technique used to sample possible outcomes of the model. It controls diversity via nucleus sampling as well as the randomness and originality of the model. The top p parameter specifies a sampling threshold during inference time. Top p shortlists the top tokens whose sum of likelihoods does not exceed a certain value. If set to less than 1, only the smallest set of most probable tokens with probabilities that add up to top p or higher are kept for generation.

0 < x ⇐ 1