Samba-1 Turbo v0.2

Release version: v0.2 | Release date: 08/07/2024


Samba-1 Turbo is a collection of high-performance inference models. The platform supports dynamic batching on each of its models, which automatically selects the optimal batch size based on the specific combinations of requests at inference time. Additionally, when used with the Samba Studio 24.6.1 release, a global queue is provided for improved scheduling efficiency.

Before deploying any of these model as an endpoint, they need to be first created as part of a composition using the Create your own CoE model workflow.

Release features

  • Inference speed improvements:

    • Up to 50% throughput improvements across existing configs.

    • Up to 70% reduction in time to first token (TTFT) across existing configs.

    • Global queue support.

      • About 50% improvement on TTFT.

      • About 25% improvement on overall latency.

  • Model selection improvements:

    • Llama 3.1 8B and 70B up to 8k seq length.

    • Llama 2 70B.

    • Mistral 7B up to 32k seq length.

    • Embedding model e5-mistral up to 32k seq length.

    • Deepseek coder 6.7B and 33B up to 16k seq length.

    • Korean models Solar and EEVE.

  • Flexibility Improvements:

    • Support for multiple experts with dynamic batching in the same endpoint.

  • Accuracy improvements:

    • Improved model generation quality issue through improved handling of long sequence input, chat history, and long sequence generation.

Samba-1 Turbo v0.2 new models

The table below describes the new models available in the Samba-1 Turbo v0.2 release. Click the triangle below to expand and view the Samba-1 Turbo new models.

In this release (Samba-1 Turbo release v0.2), models with multiple sequence length configurations will have the maximum supported context length included in the model name. For example, Meta-Llama-3-8B-Instruct-4096 supports up to 4k context length, while Meta-Llama-3-8B-Instruct-8192 supports up to 8k context length. Shorter context length variants offer higher performance.

Depending on the specific use case, the most appropriate context length variant of a model can be selected to optimize performance. In a subsequent release of Samba-1 Turbo, the platform will automatically handle this input condition-based selection. This will result in the consolidation of multiple variants into a single model that has equal performance across all input conditions.

Click to view/hide the new models in Samba-1 Turbo v0.2.
New Models Description Attributes Usage Notes

Meta-Llama-3.1-8B-Instruct

Meta-Llama-3.1-8B-Instruct is an instruction following model offering a larger context window than its predecessor, Llama 3. We support up to 8K context in this release. It has multilingual capability, supporting English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

  • Multilingual

  • Large outputs

  • General purpose

  • Document analysis

In this release, the platform supports a maximum context length of 8k for this model. Support for longer context lengths up to 128k is targeted for a subsequent release.

Meta-Llama-3.1-70-Instruct

The Meta developed Meta-Llama-3.1-70-Instruct is an instruction following model that offers a larger context window than its predecessor, Llama 3. We support up to 8K context in this release. It has multilingual capability that supports English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. The 70B parameter model performs better in benchmarks such as MATH, GSM8K (grade school math), and MMLU (knowledge acquisition) than its 8B parameter variant.

  • Multilingual

  • Large outputs

  • General purpose

  • Document analysis

In this release, the platform supports a maximum context length of 8k for this model. Support for longer context lengths up to 128k is targeted for a subsequent release.

Llama-2-70b-Chat-hf

Optimized for dialogue or conversational use-cases, Llama-2-70b-Chat-hf is a part of the Llama 2 model family developed by Meta. It can be used for general-purpose assistant-like tasks. Relative to its smaller parameter variants, this 70B parameter variant uses Grouped Attention Query (GQA) for better inference scalability. This improves the computational cost of Multi-Head Attention (MHA) and maintains better accuracy than Multi-Query Attention (MQA).

  • Conversational

  • Chat

  • Dialogue

  • Assistant-like chat

e5-Mistral-7B-Instruct

The e5-Mistral-7B-Instruct is a text embedding model derived from Mistral-7B-v0.1. This model can be used to generate text embeddings and a similarity score based on the inputs passed. It additionally supports other tasks through task instructions in the chat template. These tasks include web search query (assuming the web data is passed to the model), semantic text similarity, summarization, and retrieval of parallel text. Although this model has multilingual capabilities, it is recommended to use this model with English text.

  • Embedding

  • Text similarity

  • Retrieval

The e5-Mistral-7B-Instruct embedding models only support the Predict API, not Stream API.

Deepseek-coder-33B-Instruct

Deepseek-coder-33B-Instruct is an instruction following code model. This model can support use-cases such as generating code, code interpretation, debugging, code interpretation, and code refactoring. The model supports English and Chinese natural languages as well as low-level languages like Assembly, C, C++, and Rust. Additionally Deepseek-coder-33B-Instruct supports a multitude of languages and implementations including, general-purpose languages (C#, Go, Java, Python, Ruby, and TypeScript), functional programming (web development with CSS, HTML, and JavaScript), markup languages (JSON and Markdown), scripting languages (PowerShell and Shell), data and statistical tools (R and SQL), domain-specific languages (SQL and Verilog), and other tools (CMake, Makefile, Dockerfile, and Jupyter Notebook).

  • Coding model

  • Code generation

Solar-10.7B-Instruct-v1.0

Solar-10.7B-Instruct-v1.0 a general-purpose fine-tuned variant of its predecessor, SOLAR-10.7B. This model family uses a methodology called depth up-scaling (DUS), which makes architectural changes to a Llama 2 based model by integrating Mistral 7B weights into upscaled layers and continuously pretraining on the result. With only 10.7 billion parameters, it offers state-of-the-art performance in NLP tasks and outperforms models with up to 30 billion parameters.

  • General-purpose

  • Compact

EEVE-Korean-Instruct-10.8B-v1.0

The EEVE-Korean-Instruct-10.8B-v1.0 is a Korean and English instruction following model adapted from SOLAR-10.7B and Phi-2. It uses vocabulary expansion (EEVE) techniques, amongst other techniques, to create a model that can transfer its knowledge and understanding into Korean. It can perform traditional NLP tasks in Korean.

  • General purpose

  • Korean

  • Multilingual

Samba-1 Turbo v0.2 Composition of Experts (CoE)

Samba-1 Turbo v0.2 includes several new CoEs and corresponding models. Click the triangle below to expand and view the Samba-1 Turbo v0.2 CoEs and their models.

  • Samba-1 Turbo v0.2 release only runs on SambaNova’s SN40L hardware generations.

  • Samba-1 Turbo v0.2 release requires a minimum version of SambaStudio 24.6.1 release to run.

Click to view/hide the Samba-1 Turbo v0.2 CoEs and their models.
CoE Name Expert/Model Name Dynamic Batch Size Supported Description

Samba-1 Turbo [Beta]

Meta-Llama-3-8B-Instruct-4096


Mistral-7B-Instruct-V0.2-4096


llama-2-7B-chat-hf


deepseek-coder-6.7B-instruct-4096


Llama-2-13B-chat-hf


Meta-Llama-3-70B-Instruct-4096

1, 4, 8, 16


1, 4, 8, 16


1, 4, 8, 16


1, 4, 8, 16


1, 4, 8, 16


1, 4, 8, 16

Samba-1 Turbo [Beta] contains a breadth of general purpose LLMs in addition to a coding model. The parameter counts in this CoE vary to provide both highly accurate models as well as those that are more light-weight. This CoE can be used for general-purpose applications as well as those requiring coding tasks.

Samba-1 Turbo with embedding - small [Beta]

Meta-Llama-3-8B-Instruct-4096


Meta-Llama-3-8B-Instruct-8192


Mistral-7B-Instruct-V0.2-4096


Mistral-7B-Instruct-V0.2-32768


SOLAR-10.7B-Instruct-v1.0


EEVE-Korean-Instruct-10.8B-v1.0


e5-mistral-7b-instruct-8192


e5-mistral-7b-instruct-32768

1, 4, 8, 16


1, 4, 8, 16


1, 4, 8, 16


1, 4, 8, 16


1, 4, 8, 16


1, 4, 8, 16


1, 4, 8


1, 4, 8

Samba-1 Turbo with embedding - small [Beta] comprise the small, performant versions of the new LLMs in this Samba-1 Turbo release. In addition to the LLMs, Samba-1 Turbo with embedding - small contains the text embedding e5-mistral-7b-instruct models for tasks needing an embedding output. The two vocabulary size variants of the embedding models are 8192 and 32768, used for shorter-form and longer-form content respectively. This CoE can be used for use-cases requiring embedding models in a light-weight and performant context.

Samba-1 Turbo Llama 3.1 70B 4096 dynamic batching

Meta-Llama-3.1-70B-Instruct

1, 4, 8, 16

The Samba-1 Turbo Llama 3.1 70B 4096 dynamic CoE contains the 4096 sequence length variant of Llama 3.1 70B, making it a CoE that can be used for general purpose tasks with shorter-form inputs. Generally this results in a quicker first token latency.

Samba-1 Turbo Llama 3.1 70B 8192 dynamic batching

Meta-Llama-3.1-70B-Instruct

1, 4, 8, 16

The Samba-1 Turbo Llama 3.1 70B 8192 dynamic CoE contains the 8192 sequence length variant of Llama 3.1 70B, making it a CoE that can be used for general purpose NLP tasks.

Samba-1 Turbo Llama 3.1 8B 4096 dynamic batching

Meta-Llama-3.1-8B-Instruct

1, 4, 8, 16

The Samba-1 Turbo Llama 3.1 8B 4096 dynamic CoE contains the 4096 sequence length variant of Llama 3.1 8B, making it a CoE that can be used for general purpose tasks with shorter-form inputs. Generally this results in a quicker first token latency.

Samba-1 Turbo Llama 3.1 8B 8192 dynamic batching

Meta-Llama-3.1-8B-Instruct

1, 4, 8, 16

The Samba-1 Turbo Llama 3.1 8B 8192 dynamic CoE contains the 8192 sequence length variant of Llama 3.1 8B, making it a CoE that can be used for general purpose NLP tasks.

Samba-1 Turbo Deepseek Coder 6.7B 4096 dynamic batching

deepseek-coder-6.7B-instruct

1, 4, 8, 16

The Samba-1 Turbo Deepseek Coder 6.7B 4096 dynamic batching CoE contains the 4096 sequence length variant of deepseek-coder-6.7B-instruct. This CoE can be used for instruction-based coding tasks.

Samba-1 Turbo Llama 2 13B 4096 dynamic batching

Llama-2-13B-chat-hf

1, 4, 8, 16

The Samba-1 Turbo Llama 2 13B 4096 dynamic batching CoE contains the 13B parameter variant of Llama 2. This CoE can be used for general-purpose tasks in a conversational or dialogue-based setting.

Samba-1 Turbo Llama 2 7B 4096 dynamic batching

llama-2-7B-chat-hf

1, 4, 8, 16

The Samba-1 Turbo Llama 2 7B 4096 dynamic batching CoE contains the 7B parameter variant of Llama 2. This CoE can be used for general-purpose tasks in a conversational or dialogue-based setting.

Samba-1 Turbo Llama 3 70B 4096 dynamic batching

Meta-Llama-3-70B-Instruct

1, 4, 8, 16

The Samba-1 Turbo Llama 3 70B 4096 dynamic batching CoE contains the 4096 sequence length variant of Llama 3 70B, making it a CoE that can be used for general purpose tasks with shorter-form inputs. Generally this results in a quicker first token latency.

Samba-1 Turbo Llama 3 70B 8192 dynamic batching

Meta-Llama-3-70B-Instruct

1, 4, 8, 16

The Samba-1 Turbo Llama 3 70B 8192 dynamic batching CoE contains the 8192 sequence length variant of Llama 3 70B, making it a CoE that can be used for general purpose NLP tasks.

Samba-1 Turbo Llama 3 8B 4096 dynamic batching

Meta-Llama-3-8B-Instruct

1, 4, 8, 16

The Samba-1 Turbo Llama 3 8B 4096 dynamic batching CoE contains the 4096 sequence length variant of Llama 3 8B, making it a relatively compact CoE that can be used for general purpose tasks with shorter-form inputs.

Samba-1 Turbo Llama 3 8B 8192 dynamic batching

Meta-Llama-3-8B-Instruct

1, 4, 8, 16

The Samba-1 Turbo Llama 3 8B 8192 dynamic batching CoE contains the 8192 sequence length variant of Llama 3 8B, making it a relatively compact CoE that can be used for general purpose NLP tasks.

Samba-1 Turbo Mistral 7B 4096 dynamic batching

Mistral-7B-Instruct-V0.2

1, 4, 8

The Samba-1 Turbo Mistral 7B 4096 dynamic batching CoE contains the compact 7B parameter Mistral instruction following model. This CoE can be used for general purpose instruction following or assistant-like tasks with smaller-form inputs.

Samba-1 Turbo Llama 2 70B 4096 dynamic batching

llama-2-70b-chat-hf

1, 4, 8, 16

The Samba-1 Turbo Llama 2 70B 4096 dynamic batching CoE contains the large Llama 2 70B parameter model variant. The CoE can be used for general purpose dialogue or chat-based applications along with smaller-form inputs.

Samba-1 Turbo Deepseek Coder 33B 4096 dynamic batching

deepseek-coder-33B-instruct-4096

1, 4, 8, 16

The Samba-1 Turbo Deepseek Coder 33B 4096 dynamic batching CoE contains the 33B variant of deepseek-coder-instruct at a 4096 sequence length. This CoE can use used for relatively shorter-form input sizes for coding tasks.

Samba-1 Turbo Deepseek Coder 33B 16384 dynamic batching

deepseek-coder-33B-instruct-16384

1, 4

This CoE contains the 33B variant of deepseek-coder-instruct at a 16384 sequence length. Because of the increased sequence length, this CoE can use used for coding-related tasks. This could include reading in larger coding blocks for interpretation and more.

Known limitations

  • In rare cases, if an endpoint is kept live for an extended period (such as several days), it may encounter a hang situation. When this occurs, the endpoint will need to be restarted before it can resume normal operations.

  • When making requests to an inference endpoint as a user, you may occasionally encounter errors that are not directly related to your specific input. This is due to a known limitation where all requests within the same batch are processed as a group. As such, if one request in the batch fails the entire batch will fail and all requests will receive the same error message.

    • For example, consider a batch of four requests. If three of the requests are successful, but one request exceeds the maximum supported context length of the selected model, all four requests in the batch will fail. In this case, even though only one request exceeded the context length limit, the entire batch will return with the exceeding max context length error.

  • Cancelling requests for Samba 1 Turbo App models has a known limitation where the cancellation does not prevent the model from processing the requests.

    • For example, if a user sends 30 concurrent requests and immediately cancels all of them, the model will still process those requests. The can lead to slower time to first token (TTFT) for any subsequent requests, as the model is still busy handling the cancelled request.

  • Meta-Llama-3.1-8B-Instruct and Meta-Llama-3.1-70B-Instruct in this release support a maximum context length of 8k for this model. Support for longer context lengths up to 128k is targeted for a subsequent release.

  • The e5-mistral-7b-Instruct embedding models only support a batch size (BS) up to 8.

  • When using the e5-mistral-7b-Instruct embedding models in the Playground, an error may be displayed due to a mismatch between the predict and stream APIs. This is not a crash, but rather an expected behavior when using an embedding model in playground, which is not its typical usage.

  • The deepseek-coder-33B-Instruct model at 16k sequence length supports a batch size (BS) up to 4.