This document describes different aspects of text generation, including types of generation, model selection, creating prompts, and managing multi-turn conversations.

Types of generations

You can use various methods to generate text, including non-streaming, streaming, and asynchronous completions.

Simple generation (non-streaming)

Use the following code to perform text generation with the OpenAI Python client in a non-streaming manner.

Simple text generation python code
from openai import OpenAI
client = OpenAI(
    base_url="https://api.sambanova.ai/v1", 
    api_key="<your-api-key>"
)
completion = client.chat.completions.create(
    model="Meta-Llama-3.1-8B-Instruct",
    messages = [  
        {"role": "system", "content": "Answer the question in a couple sentences."},
        {"role": "user", "content": "Share a happy story with me"}
    ]
)
print(completion.choices[0].message.content)

Asynchronous generation (non-streaming)

For asynchronous completions, use the AsyncOpenAI Python client, as shown below.

Asynchronous text generation python code
from openai import AsyncOpenAI
import asyncio
async def main():
    client = AsyncOpenAI(
        base_url="https://api.sambanova.ai/v1", 
        api_key="<your-api-key>"
    )
    completion = await client.chat.completions.create(
        model="Meta-Llama-3.1-8B-Instruct",
        messages = [
            {"role": "system", "content": "Answer the question in a couple sentences."},
            {"role": "user", "content": "Share a happy story with me"}
        ]
    )
    print(completion.choices[0].message.content)
asyncio.run(main())

Streaming response

For real-time streaming completions, use the following approach with the OpenAI Python client.

Streaming response python code
from openai import OpenAI
client = OpenAI(
    base_url="https://api.sambanova.ai/v1", 
    api_key="<your-api-key>"
)
completion = client.chat.completions.create(
    model="Meta-Llama-3.1-8B-Instruct",
    messages = [
        {"role": "system", "content": "Answer the question in a couple sentences."},
        {"role": "user", "content": "Share a happy story with me"}
    ],
    stream = True
)
for chunk in completion:
  print(chunk.choices[0].delta.content, end="")

Asynchronous streaming

You can leverage the AsyncOpenAI Python client to enable asynchronous streaming.

Asynchronous streaming python code
from openai import AsyncOpenAI
import asyncio
async def main():
    client = AsyncOpenAI(
        base_url="https://api.sambanova.ai/v1", 
        api_key="<your-api-key>"
    )
    completion = await client.chat.completions.create(
        model="Meta-Llama-3.1-8B-Instruct",
        messages = [
            {"role": "system", "content": "Answer the question in a couple sentences."},
            {"role": "user", "content": "Share a happy story with me"}
        ],
        stream = True
    )
    async for chunk in completion:
        print(chunk.choices[0].delta.content, end="")
asyncio.run(main())

Model selection

Models differ in their architecture, impacting their speed and response quality. Selecting a model depends on the factors shown below.

FactorConsideration
Task complexityLarger models are better suited for complex tasks.
Accuracy requirementsLarger models generally offer higher accuracy.
Cost and resourcesLarger models come with increased costs and resource demands.

Experiment with various models to find the one that best fits your specific use case.

Creating effective prompts

Prompt engineering is the practice of designing and refining prompts to optimize responses from large language models (LLMs). This process is iterative and requires experimentation to achieve the best possible outcomes.

Building a prompt

A basic prompt can be as simple as a few words to elicit a response from the LLM. However, for more complex use cases, additional elements may be needed, as shown below.

ElementDescription
Defining a personaAssigning a specific role to the model (e.g., “You are a financial advisor”).
Providing contextSupplying background information to guide the model’s response.
Specifying output formatInstructing the model to respond in a particular style (e.g., JSON, bullet points, structured text).
Describing a use caseClarifying the goal of the interaction.

Advanced prompting techniques

To improve response quality and reasoning, more advanced techniques can be used.

TechniqueDescription
In-context learningProviding examples of desired outputs to guide the model.
Chain-of-Thought (CoT) promptingEncouraging the model to articulate its reasoning before delivering a response.

For more details about prompt engineering, see A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications.

Messages and roles

In chat-based interactions, messages are represented as dictionaries with specific roles and content.

ElementDescription
roleSpecifies who is sending the message.
contentContains the message text.

Common roles

Roles are typically categorized as system, user, or assistant.

RoleDescription
systemProvides general instructions to the model.
userRepresents user input.
assistantContains the model’s response.

Multi-turn conversation

To maintain context across multiple exchanges, messages in a conversational AI system are typically stored as a list of dictionaries. Each dictionary contains keys that specify the sender’s role and the message content. This structure helps the system track context across multiple turns in a conversation.

Below is an example of how a multi-turn conversation is structured using the Meta-Llama-3.1-8B-Instruct model:

Structuring Multi-Turn Conversations in Meta-Llama-3.1-8B-Instruct using python
completion = client.chat.completions.create(
    model="Meta-Llama-3.1-8B-Instruct",
    messages = [
        {"role": "user", "content": "Hi! My name is Peter and I am 31 years old. What is 1+1?"},
        {"role": "assistant", "content": "Nice to meet you, Peter. 1 + 1 is equal to 2"},
        {"role": "user", "content": "What is my age?"}
    ],
    stream = True
)
for chunk in completion:
  print(chunk.choices[0].delta.content, end="")

After running the program, you should see an output similar to the following.

Example output
You told me earlier, Peter. You're 31 years old.

By structuring conversations this way, the model can maintain context, recall prior user inputs, and provide more coherent responses.

Considerations for long conversations

When engaging in long conversations with LLMs, certain factors such as token limits and memory constraints must be considered to ensure accuracy and coherence.

  • Token limits - LLMs have a fixed context window, limiting the number of tokens they can process in a single request. If the input exceeds this limit, the system might truncate it, leading to incomplete or incoherent responses.

  • Memory constraints - The model does not retain context beyond its input window. To preserve context, past messages should be re-included in prompts.

By structuring prompts effectively and managing conversation history, you can optimize interactions with LLMs for better accuracy and coherence.