Playground

The Playground provides an in-platform experience for generating predictions using deployed generative tuning endpoints. You can choose between a chat mode and completion mode experience. A user preset option is available to populate the editor and quickly experience generative tuning. Alternatively, you can input text directly into the editor, without selecting a user preset.

Figure 1. Playground interface

Requirements

A live endpoint is required to use the Playground. If no endpoint is available, the message below will display.

Figure 2. No live endpoint exists

Access the Playground from the left menu

To access the Playground experience directly, click Playground from the left menu. The Playground window will open.

Figure 3. Playground menu icon

Access the Playground from an endpoint window

From an Endpoint window, click Try now. The Playground window will open.

Figure 4. Generative tuning endpoint Try Now

Use the Playground editor

Select your live endpoint from the Endpoint drop-down.

Figure 5. Playground endpoint select
Choose the type of interaction you wish to have from the Mode drop-down.
1. Chat mode provides a word-by-word response to your prompt. Additionally, follow-on prompts are kept within the context of your conversation. This allows the Playground to understand your prompts without the need to restate preceding information.
2. In Completion mode, the Playground will provide complete statement responses to your prompt. Providing specific instructions in your prompt will help produce the best generations.
  
  Figure 6. Playground mode select
Select one of the presets from the Add from presets drop-down to populate the editor to quickly experience generative tuning.

Figure 7. Add from presets Chat mode

Figure 8. Add from presets Completion mode
1. Alternatively, you can input text directly into the editor, without selecting a preset.
Click Submit to initiate a response by the platform.
1. Chat mode is displayed in the editor with row striping, designating your prompt as a highlighted row.
  1. Click Stop generating to force the editor to halt and discontinue generating a response that is in progress.
  2. Click the copy icon to copy the corresponding prompt or response to your clipboard.
2. A Completion mode response is displayed in the editor with highlighted blue text.
  1. Click Download results to download the last response provided. The file will be downloaded to the location configured by your browser.
    
    Figure 9. Chat mode response
    
    Figure 10. Completion mode response
Click Tokens available to expand the menu and access the System prompt box to toggle it On or Off. The System prompt sends an initial instruction input prompt to the model, which guides responses and sets context.

Figure 11. System prompt box
Click the clear editor icon to clear the Playground editor. This will remove all inputs and responses, as well as reset the tokens available.

If your prompt returns an unexpected response or error, clear the editor before submitting additional prompts.

Prompt guidelines

We recommend using the following guidelines when submitting prompts to the Playground.

Prompt structure: End the prompts with either a colon (:), a question mark (?), or another way of letting the model know that it is time for it to start generating. For example, using Please summarize the previous article: (with a colon) is a better prompt than Please summarize the previous article (without a colon). Adding these annotations tends to lead to better generations as it indicates to the model that you’ve finished your question and are expecting an answer.
Resubmitting prompts: Please ensure that you do not submit an <|endoftext|> token in your prompt. This might happen if you hit submit twice after the model returns its generations.

Tokens

Tokens are basic units based on text that are used when processing a prompt to generate a language output, or prediction. They can be thought of as pieces of words. Tokens are not defined exactly on where a word begins or ends and can include trailing spaces (spaces after a word) and subwords. Before the processing of a prompt begins, the input is broken into tokens.

The SambaStudio Playground displays the available tokens, updated for each submission. Click Tokens available to expand and show Max seq length, System prompt, and Used tokens.

Max seq length: Represents the maximum number of tokens supported by the model used for the endpoint.
System prompt: Displays the number of tokens used by the System prompt input statement.
Used tokens: The number of tokens displayed in the editor, including inputs and responses.

Tokens available: Displays the number of tokens that are remaining to be used. The number is updated for each submission by subtracting the System prompt and Used tokens values from Max seq length. Click the clear editor icon to reset the tokens available and clear the editor.

Figure 12. Tokens available expanded

Tuning parameters

Tuning parameters provide additional flexibility and options for generative tuning. Adjusting these parameters allows you to search for the optimal values to maximize the performance and output of the response.

Figure 13. Tuning parameters panel

The following parameters are available in the Playground.

Do sampling

Toggles whether to use sampling. If not enabled, greedy decoding is used. When enabled, the platform randomly picks the next word according to its conditional probability distribution. Language generation using sampling does not remain deterministic. If you need to have deterministic results, set this to off, as the model is less likely to generate unexpected or unusual words. Setting it to on allows the model a better chance of generating a high quality response, even with inherent deficiencies. However, this is not desirable in an industrial pipeline as it can lead to more hallucinations and non-determinism.

Setting Do sampling to On is recommended for endpoints that use BLOOMChat and Human Aligned (HA) models.
Setting Do sampling to Off is recommended for endpoints that use Instruction Tuned (IT) models.
If Do sampling is set to Off, Temperature, Top k, and Top p are ignored and will have no affect.

Max tokens to generate

The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. When using max tokens to generate, make sure your total tokens for the prompt plus the requested max tokens to generate are not more than the supported sequence length of the model. You can use this parameter to limit the response to a certain number of tokens. The generation will stop under the following conditions:

When the model stops generating due to <|endoftext|>.
The generation encounters a stop sequence set up in the parameters.
The generation reaches the limit for max tokens to generate.

Repetition penalty

The repetition penalty, also known as frequency penalty, controls the model’s tendency to repeat predictions. The repetition penalty reduces the probability of words that have previously been generated. The penalty depends on how many times a word has previously occurred in the prediction. This parameter can be used to penalize words that were previously generated or belong to the context. It decreases the model’s likelihood to repeat the same line verbatim.

A value setting of 1 means no penalty.

Return token count only

When set to On, the Playground will not generate a response in completion mode, but will update the tokens available. When set to Off, the Playground will generate responses and update tokens available.

This parameter has no affect in chat mode.

Stop sequences

Stop sequences are used to make the model stop generating text at a desired point, such as the end of a sentence or a list. It is an optional setting that tells the API when to stop generating tokens. The completion will not contain the stop sequence. If nothing is passed, it defaults to the token <|endoftext|>. This token represents a probable stopping point in the text.

Temperature

The value used to modulate the next token probabilities. As the value decreases, the model becomes more deterministic and repetitive. With a temperature between 0 and 1, the randomness and creativity of the model’s predictions can be controlled. A temperature parameter close to 1 would mean that the logits are passed through the softmax function without modification. If the temperature is close to 0, the highest probable tokens will become very likely compared to the other tokens: the model becomes more deterministic and will always output the same set of tokens after a given sequence of words.

A Temperature of 0.7 or higher is recommended for endpoints that use Human Aligned (HA) models.
Temperature will have no affect when Do sampling is set to Off.

Top k: The number of highest probability vocabulary tokens to keep for top k filtering. Top k means allowing the model to choose randomly among the top k tokens by their respective probabilities. For example, choosing the top three tokens means setting the top k parameter to a value of 3. Changing the top k parameter sets the size of the shortlist the model samples from as it outputs each token. Setting top k to 1 gives us greedy decoding.

Top k will have no affect when Do sampling is set to Off.
Top logprobs: Shows the top <number> (the numerical value entered) of tokens by its probability to be generated. This indicates how likely a token was to be generated next. This helps debug a given generation and see alternative options to the generated token. The highlighted token is the one that the model predicted with the list sorted by probabilities from high to low, until the top <number> is reached. On the basis of tuning other parameters, you can use the feature to analyze how the predicted tokens by the model might change.

Hover over a highlighted token to display the list.

Figure 14. Top logprobs set to 3

Top p: Top p sampling, sometimes called nucleus sampling, is a technique used to sample possible outcomes of the model. It controls diversity via nucleus sampling as well as the randomness and originality of the model. The top p parameter specifies a sampling threshold during inference time. Top p shortlists the top tokens whose sum of likelihoods does not exceed a certain value. If set to less than 1, only the smallest set of most probable tokens with probabilities that add up to top p or higher are kept for generation.

Top p will have no affect when Do sampling is set to Off.

View tuning parameter information in the GUI

A tuning parameter’s definition and values are viewable within the SambaStudio GUI. Follow the steps below to view information for a tuning parameter.

In the Tuning parameters panel, hover over the parameter name you wish to view. An overview parameter card will display.
Click the > (right arrow) to display the complete parameter card that includes its definition and values.
Click the X to close the complete parameter card.

Figure 15. Tuning parameter information

View code

Follow the steps below to view and copy code generated from the current input.

At the bottom of the editor, click View code to open the View code window.
Click the CURL, CLI, or Python SDK tab to view the corresponding code block and make a request programmatically.
Click Copy code to copy the selected code block to your clipboard.
Click Close, or the upper right X, to close the window and return to the Playground.

Figure 16. View code