Llama Stack

In this guide, you’ll learn how to set up and use Llama Stack—a standardized framework that simplifies AI application development. We’ll walk you through building the SambaNova distribution server, installing the client, and running your first model inference. Whether you’re prototyping or scaling up, this guide will help you get started quickly with best practices from the Llama ecosystem integrated into a modular, efficient architecture.

Components of Llama Stack

Llama Stack includes two main components:

Server – A running distribution of Llama Stack that hosts various adaptors.
Client – A consumer of the server’s API, interacting with the hosted adaptors.

Get your SambaCloud API key

Create a SambaCloud account.
Navigate to the API key section.
Generate a new key (if you don’t already have one).
Copy and store key securely

Build the SambaNova Llama Stack server

Set up a Python virtual environment

python -m venv .venv
source .venv/bin/activate

Install required dependencies

pip install uv
pip install llama-stack

Run the SambaNova distribution server

Export required environment variables

export LLAMA_STACK_PORT=8321
export ENABLE_SAMBANOVA=sambanova
export SAMBANOVA_API_KEY="your-api-key-here"

Run the server with Docker

docker run -it \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ~/.llama:/root/.llama \
  llamastack/distribution-starter \
  --port $LLAMA_STACK_PORT \
  --env SAMBANOVA_API_KEY=$SAMBANOVA_API_KEY

Install the Llama Stack client

In the same or another environment, run:

  pip install llama-stack-client

Use the client to interact with the server

The following Python code demonstrates basic usage:

from llama_stack_client import LlamaStackClient

LLAMA_STACK_PORT = 8321
client = LlamaStackClient(base_url=f"http://localhost:{LLAMA_STACK_PORT}")

# List all available models
models = client.models.list()
print("--- Available models: ---")
for m in models:
    print(f"- {m.identifier}")
print()

# Choose a model from the list
model = "sambanova/sambanova/Meta-Llama-3.3-70B-Instruct"

# Run chat completion
response = client.inference.chat_completion(
    messages=[
        {"role": "system", "content": "You are a friendly assistant."},
        {"role": "user", "content": "Write a two-sentence poem about llama."},
    ],
    model_id=model,
)

print(response.completion_message.content)

This demonstrates the full client-server loop - connecting, listing models, and running inference. Explore the SambaNova Llama Stack integration repo to find several use cases using SambaNova distribution LLMs, Embeddings, tools, and agent adaptors .

Llama Stack documentation

Refer to the Llama Stack docs to:

Understand core concepts
Dive into sample apps
Learn how to extend and customize the framework

Overview

Agent building and orchestration

Coding assistants

Evaluation and monitoring

LLM frameworks

Low-code platforms

Hyperscalers

Real-time voice

Tool and Browser Use

Vector DB and search

Components of Llama Stack

Get your SambaCloud API key

Build the SambaNova Llama Stack server

Run the SambaNova distribution server

Install the Llama Stack client

Use the client to interact with the server

Llama Stack documentation

Overview

Agent building and orchestration

Coding assistants

Evaluation and monitoring

LLM frameworks

Low-code platforms

Hyperscalers

Real-time voice

Tool and Browser Use

Vector DB and search

​Components of Llama Stack

​Get your SambaCloud API key

​Build the SambaNova Llama Stack server

​Run the SambaNova distribution server

​Install the Llama Stack client

​Use the client to interact with the server

​Llama Stack documentation

Components of Llama Stack

Get your SambaCloud API key

Build the SambaNova Llama Stack server

Run the SambaNova distribution server

Install the Llama Stack client

Use the client to interact with the server

Llama Stack documentation