Samba-1 Turbo v0.3

Release version: v0.3 | Release date: 09/16/2024


Samba-1 Turbo is a collection of high-performance inference models. The platform supports dynamic batching on each of its models, which automatically selects the optimal batch size based on the specific combinations of requests at inference time. Additionally, when used with the Samba Studio 24.8.1 or more recent releases, a global queue is provided for improved scheduling efficiency.

For any model that is not part of the released CoE in Samba-1 Turbo v0.3,, it will have to be created as part of a composition using the Create your own CoE model workflow before you can deploy it as an endpoint.

Release features

The Samba-1 Turbo v0.3 release includes the following improvements.

  • Inference speed improvements:

    • Throughput: Achieved improvements of up to 189% on existing configurations, with an average improvement of 126% across all configurations.

    • Time to first token (TTFT): Reduced by up to 90% on existing configurations, with an average reduction of 20% across all configurations.

    • Improved performance scaling efficiency across batch sizes larger than 1.

    • Addition of batch size 32 option for Llama 3.1 8B, 70B, and Mistral-7B.

  • Added the following model selections:

    • Qwen 2 7B.

    • Qwen 2 72B.

    • Sarashina2 7B.

    • Sarashina2 70B.

  • Model capability improvements:

    • The Samba-1 Turbo v0.3 release expands on the collective multilingual capabilities of our offered model set by introducing specialized models for particular languages that outperform general purpose models on respective language benchmarks.

    • Consolidated the previously separate CoE models from v0.2 into a unified, converged form, addressing limitations from earlier versions.

  • Flexibility Improvements:

    • The new dynamic sequence size feature abstracts away the need for selecting between different sequence length variants of the same model. The v0.3 release delivers a streamlined, single-model user experience akin to working with other platforms, such as GPU.

    • The new DDR usage bar feature displays the amount of Double Data Rate (DDR) memory used by the selected experts during the creation of CoE models. This enables precision creation of larger and more sophisticated CoEs, surpassing the limitations of the previous release.

    • Support for fast inference, at the Turbo speed, when using imported or fine-tuned custom checkpoints.

  • Accuracy improvements:

    • Qwen2-72B-Instruct and Qwen2-7B-Instruct introduce models that excel in complex tasks, such as coding, math, and logical reasoning, which were previously strong but are now further enhanced.

    • Sarashina2-7b and Sarashina2-70b add Japanese language tasks, outperforming previous iterations and general purpose models across various relevant benchmarks.

    • Together, these models enhance the overall accuracy, versatility, and language capabilities of the previous model options and capabilities, leading to a comprehensive improvement in accuracy across a wider array of tasks and languages

Samba-1 Turbo v0.3 model options

The Samba-1 Turbo v0.3 release expands model support and enhances performance on all models. It now supports 17 model architectures out of the box, including four new additions.

A key feature of this release is the ability to perform fast inference at the Turbo speed when using imported or fine-tuned custom checkpoints. This means that any new checkpoints, which are compatible with these 17 model architectures, can run at the same turbo performance levels as the pre-existing models. This flexibility enables high-speed inference, allowing users to benefit from top-tier performance regardless of whether they’re using standard models or their own custom-trained versions. This combination of expanded architecture support and uniform high-speed inference capabilities across custom and standard models represents a significant leap forward in both flexibility and performance for model deployment.

The Samba-1 Turbo v0.3 release includes the new dynamic sequence length feature. This new feature eliminates the need to manually select between different sequence length variants of the same model. As a result, model names will no longer include specific sequence length descriptors. For example, models previously differentiated by context length (e.g., Meta-Llama-3-8B-Instruct-4096 and Meta-Llama-3-8B-Instruct-8192) will now be consolidated under a single name (e.g., Meta-Llama-3-8B-Instruct). Users can use any sequence length, in a flexible manner, up to the maximum supported by the model without switching between variants. The platform automatically optimizes performance based on input length and use case, ensuring consistent efficiency across all input conditions. This advancement simplifies model selection and usage, effectively consolidating what were previously separate variants into a single, dynamically adaptive mode.

The table below describes the model options in the Samba-1 Turbo v0.3 release. Click the triangle below to expand and view the Samba-1 Turbo v0.3 model options.

Click to view/hide the model options in Samba-1 Turbo v0.3.
Model Name Release Status Description Attributes Usage Notes

Qwen2-72B-Instruct

New

Qwen2-72B-Instruct is a large-scale instruction-following language model designed to handle complex tasks in natural language understanding and generation. It outperforms competitors like LLaMA 3 and Mixel 8, particularly excelling in coding, math, and logical reasoning. Practical tests show that the 72B model handles intricate coding tasks with precision and provides well-structured explanations for complex logic problems. It excels in multilingual capabilities, handling both English and Chinese.

  • Instruction following

  • Coding

  • Math

  • Logical reasoning

  • General purpose language tasks

Qwen2-7B-Instruct

New

Qwen2-7B-Instruct is a smaller, efficient model with 7 billion parameters that provides reliable performance across various natural language tasks. It balances computational efficiency with robust capabilities, making it suitable for chatbots, content creation, and language translation. While not as powerful as the 72B model, it still outperforms many competitors in head-to-head evaluations, offering a versatile tool for users needing solid performance without the computational demands of larger models.

  • Chatbots

  • Content creation

  • Language translation

Sarashina2-7b

New

Sarishina2-7B is a Japanese language model developed by SB Intuitions. The model excels in natural language processing tasks and outperforms its predecessor (Sarishina1) in various benchmarks such as AI王, JCommonsenseQA, and JSQuAD. The model benefits from improved pretraining methods leading to enhanced capabilities in answering complex questions and understanding Japanese text. Sarishina2 aims to provide advanced performance in Japanese language tasks, showcasing significant improvements over earlier iterations. This 7B version of Sarishina-2 is compact and efficient compared to its other variants.

  • Multilingual

  • Japanese translation

  • General purpose language tasks

Sarashina2-70b

New

Sarishina2-7B is a Japanese language model developed by SB Intuitions. The model excels in natural language processing tasks and outperforms its predecessor (Sarishina1) in various benchmarks such as AI王, JCommonsenseQA, and JSQuAD. The model benefits from improved pretraining methods leading to enhanced capabilities in answering complex questions and understanding Japanese text. Sarishina2 aims to provide advanced performance in Japanese language tasks, showcasing significant improvements over earlier iterations. This 70B version of Sarishina-2 can be used for tasks needing high-accuracy.

  • Multilingual

  • Japanese translation

Meta-Llama-3.1-8B-Instruct

Existing

Meta-Llama-3.1-8B-Instruct is an instruction following model offering a larger context window than its predecessor, Llama 3. We support up to 8K context in this release. It has multilingual capability, supporting English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

  • Multilingual

  • Large outputs

  • General purpose

  • Document analysis

In this release, the platform supports a maximum context length of 8k for this model. Support for longer context lengths up to 128k is targeted for a subsequent release.

Meta-Llama-3.1-70-Instruct

Existing

Meta-Llama-3.1-70-Instruct is an instruction following model, developed by Meta, that offers a larger context window than its predecessor, Llama 3. We support up to 8K context in this release. It has multilingual capability, supporting English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. The 70B parameter model performs better in benchmarks such as MATH, GSM8K (grade school math), and MMLU (knowledge acquisition) than its 8B parameter variant.

  • Multilingual

  • Large outputs

  • General purpose

  • Document analysis

In this release, the platform supports a maximum context length of 8k for this model. Support for longer context lengths up to 128k is targeted for a subsequent release.

Meta-Llama-3-8B-Instruct

Existing

Meta-Llama-3-8B-Instruct is an instruction following model belonging to the Llama 3 family of large language models. It introduced improvements on Llama 2 in areas such false refusal rates, improved alignment, and increased diversity in model responses. This family of models also sees improvements in capabilities like reasoning, code generation, instruction following, dialogue use cases, helpfulness, and safety. The Meta-Llama-3-8B-Instruct model, at 8B parameters, can be used for tasks and use-cases revolving around efficiency and needing lower computational workloads.

  • Instruction following

  • General purpose language tasks

Meta-Llama-3-70-Instruct

Existing

Meta-Llama-3-70B-Instruct is an instruction following model belonging to the Llama 3 family of large language models. It introduced improvements on Llama 2 in areas such false refusal rates, improved alignment, and increased diversity in model responses. This family of models also sees improvements in capabilities like reasoning, code generation, instruction following, dialogue use cases, helpfulness, and safety. The Meta-Llama-3-70B-Instruct model, with its 70B parameters, is a balance of performance and resource efficiency.

  • Instruction following

  • General purpose language tasks

Llama-2-7b-Chat-hf

Existing

Llama-2-7b-Chat-hf is a large language model, created by Meta, that expanded on the capabilities of the Llama 1 family. The Llama 2 family of models was trained on significantly more data than those of its predecessor. Llama-2-7b-Chat-hf can be used for use-cases valuing performance and efficiency. It is also more compact than its 13B and 70B variants, while still maintaining accuracy.

  • Conversational

  • Chat

  • Dialogue

  • Assistant-like chat

Llama-2-13b-Chat-hf

Existing

Llama-2-13b-Chat-hf is a large language model, created by Meta, that expanded on the capabilities of the Llama 1 family. The Llama 2 family of models was trained on significantly more data than those of its predecessor. Llama-2-13b-Chat-hf strikes a balance between performance and accuracy, sitting at 13B parameters.

  • Conversational

  • Chat

  • Dialogue

  • Assistant-like chat

Llama-2-70b-Chat-hf

Existing

Llama-2-70b-Chat-hf is a large language model, created by Meta, that expanded on the capabilities of the Llama 1 family. The Llama 2 family of models was trained on significantly more data than those of its predecessor. This chat model is optimized for dialogue use cases. Llama-2-70b-Chat-hf, compared to its 13B and 7B parameter variants, uses Grouped-Query Attention (GQA) for improved inference scalability.

  • Conversational

  • Chat

  • Dialogue

  • Assistant-like chat

Mistral-7B-Instruct-v0.2

Existing

Mistral-7B-Instruct-v0.2 is an instruction fine-tuned version of the Mistral-7B-v0.2 language model, tailored for tasks requiring precise instruction-following capabilities. This model is particularly well-suited for a variety of applications, including content generation, text analysis, and problem-solving. It excels in creating coherent and contextually relevant text, making it ideal for tasks like report writing, code generation, and answering questions. The enhancements in this version enable it to handle more sophisticated tasks with higher accuracy and efficiency.

  • Instruction following

  • General purpose tasks

e5-Mistral-7B-Instruct

Existing

e5-Mistral-7B-Instruct is a text embedding model derived from Mistral-7B-v0.1. This model can be used to generate text embeddings and a similarity score based on the inputs passed in. It additionally supports other tasks through task instructions in the chat template (see the modelcard for detailed information). These tasks include web search query (assuming the web data is passed to the model), semantic text similarity, summarization, or retrieval of parallel text. Although this model has multilingual capabilities, it is recommended that this model is used with English text.

  • Embedding

  • Text similarity

  • Retrieval

The e5-mistral-7b-Instruct embedding models only support the Predict API, not Stream API.

Deepseek-coder-6.7B-Instruct

Existing

Deepseek-coder-6.7B-Instruct is a compact, instruction following code model. This model can support use-cases such as generating code, code interpretation, debugging, code interpretation, and code refactoring. The model supports English and Chinese natural languages as well as low-level languages like Assembly, C, C++, and Rust. Additionally Deepseek-coder-33B-Instruct supports a multitude of languages and implementations including, general-purpose languages (C#, Go, Java, Python, Ruby, and TypeScript), functional programming (web development with CSS, HTML, and JavaScript), markup languages (JSON and Markdown), scripting languages (PowerShell and Shell), data and statistical tools (R and SQL), domain-specific languages (SQL and Verilog), and other tools (CMake, Makefile, Dockerfile, and Jupyter Notebook).

  • Coding model

  • Code generation

Deepseek-coder-33B-Instruct

Existing

Deepseek-coder-33B-Instruct is an instruction following code model. This model can support use-cases such as generating code, code interpretation, debugging, code interpretation, and code refactoring. The model supports English and Chinese natural languages as well as low-level languages like Assembly, C, C++, and Rust. Additionally Deepseek-coder-33B-Instruct supports a multitude of languages and implementations including, general-purpose languages (C#, Go, Java, Python, Ruby, and TypeScript), functional programming (web development with CSS, HTML, and JavaScript), markup languages (JSON and Markdown), scripting languages (PowerShell and Shell), data and statistical tools (R and SQL), domain-specific languages (SQL and Verilog), and other tools (CMake, Makefile, Dockerfile, and Jupyter Notebook).

  • Coding model

  • Code generation

Solar-10.7B-Instruct-v1.0

Existing

Solar-10.7B-Instruct-v1.0 a general-purpose fine-tuned variant of its predecessor SOLAR-10.7B. This model family uses a methodology called depth up-scaling (DUS), which makes architectural changes to a Llama 2 based model by integrating Mistral 7B weights into upscaled layers and continuously pretraining on the result. With only 10.7B parameters, 10.7 billion parameters, it offers state-of-the-art performance in NLP tasks, even outperforming models with up to 30 billion parameters.

  • General purpose

  • Compact

EEVE-Korean-Instruct-10.8B-v1.0

Existing

The EEVE-Korean-Instruct-10.8B-v1.0 is a Korean and English instruction following model adapted from SOLAR-10.7B and Phi-2 that uses vocabulary expansion (EEVE) techniques, amongst others, to create a model that can transfer its knowledge and understanding into Korean. It can perform traditional NLP tasks in Korean.

  • General purpose model

  • Korean

  • Multilingual

Summary of released Composition of Experts (CoE)

This update consolidates the previously divided CoEs from the v0.2 release into a single, converged bundle. This consolidation not only streamlines the model organization, but also improves the overall ease of use of the platform.

  • Samba-1 Turbo v0.3 release only runs on SambaNova’s SN40L hardware generations.

  • Samba-1 Turbo v0.3 release requires a minimum version of SambaStudio 24.8.1 release to run.

CoE Name Expert/Model Name Dynamic Batch Size Supported Description

Samba-1 Turbo

Meta-Llama-3-8B-Instruct


Meta-Llama-3-70B-Instruct


Mistral-7B-Instruct-V0.2


e5-mistral-7b-Instruct


Qwen2-7B-Instruct


Qwe2-72B-Instruct

1, 4, 8, 16, 32


1, 4, 8, 16, 32


1, 4, 8, 16, 32


1, 4, 8


1, 4, 8, 16


1, 4, 8, 16

Samba-1 Turbo is an example composition of high-performance inference models built by SambaNova. This composition integrates three LLMs and one text embedding model, enabling RAG applications through a single endpoint. Deployable on 8 RDUs, each model can be accessed directly via an API call with low inter-model switching time. Each model offers multiple batch sizes, and the platform automatically selects the optimal batch size based on the specific combination of requests at inference time for the best concurrency.

Summary of deprecated Composition of Experts (CoE)

The v0.3 release consolidates the finely divided out-of-the-box CoE bundles from the v0.2 release into a more unified, converged form. Users can now access v0.2 models in two ways: through the new, consolidated out-of-the-box CoE bundle: Samba-1 Turbo, in the v0.3 release, or as expert options that users can select while creating their own custom CoE bundles during the CoE creation workflow. This consolidation simplifies model management while maintaining flexibility for users who require customized model groupings.

Click to view/hide the deprecated CoEs and their models.
CoE Name Expert/Model Name Dynamic Batch Size Supported Description

Samba-1 Turbo [Beta]

Meta-Llama-3-8B-Instruct-4096


Mistral-7B-Instruct-V0.2-4096


llama-2-7B-chat-hf


deepseek-coder-6.7B-instruct-4096


Llama-2-13B-chat-hf


Meta-Llama-3-70B-Instruct-4096

1,4,8,16


1,4,8,16


1,4,8,16


1,4,8,16


1,4,8,16


1,4,8,16

Samba-1 Turbo [Beta] contains a breadth of general purpose LLMs in addition to a coding model. The parameter counts in this CoE vary to provide both highly accurate models as well as those that are more light-weight. This CoE can be used for general-purpose applications as well as those requiring coding tasks.

Samba-1 Turbo with embedding - small [Beta]

Meta-Llama-3-8B-Instruct-4096


Meta-Llama-3-8B-Instruct-8192


Mistral-7B-Instruct-V0.2-4096


Mistral-7B-Instruct-V0.2-32768


SOLAR-10.7B-Instruct-v1.0


EEVE-Korean-Instruct-10.8B-v1.0


e5-mistral-7b-instruct-8192


e5-mistral-7b-instruct-32768

1,4,8,16


1,4,8,16


1,4,8,16


1,4,8,16


1,4,8,16


1,4,8,16


1,4,8


1,4,8

Samba-1 Turbo with embedding - small [Beta] comprise the small, performant versions of the new LLMs in this Samba-1 Turbo release. In addition to the LLMs, Samba-1 Turbo with embedding - small contains the text embedding e5-mistral-7b-instruct models for tasks needing an embedding output. The two vocabulary size variants of the embedding models are 8192 and 32768, used for shorter-form and longer-form content respectively. This CoE can be used for use-cases requiring embedding models in a light-weight and performant context.

Samba-1 Turbo Llama 3.1 70B 4096 dynamic batching

Meta-Llama-3.1-70B-Instruct

1,4,8,16

The Samba-1 Turbo Llama 3.1 70B 4096 dynamic CoE contains the 4096 sequence length variant of Llama 3.1 70B, making it a CoE that can be used for general purpose tasks with shorter-form inputs. Generally this results in a quicker first token latency.

Samba-1 Turbo Llama 3.1 70B 8192 dynamic batching

Meta-Llama-3.1-70B-Instruct

1,4,8,16

The Samba-1 Turbo Llama 3.1 70B 8192 dynamic CoE contains the 8192 sequence length variant of Llama 3.1 70B, making it a CoE that can be used for general purpose NLP tasks.

Samba-1 Turbo Llama 3.1 8B 4096 dynamic batching

Meta-Llama-3.1-8B-Instruct

1,4,8,16

The Samba-1 Turbo Llama 3.1 8B 4096 dynamic CoE contains the 4096 sequence length variant of Llama 3.1 8B, making it a CoE that can be used for general purpose tasks with shorter-form inputs. Generally this results in a quicker first token latency.

Samba-1 Turbo Llama 3.1 8B 8192 dynamic batching

Meta-Llama-3.1-8B-Instruct

1,4,8,16

The Samba-1 Turbo Llama 3.1 8B 8192 dynamic CoE contains the 8192 sequence length variant of Llama 3.1 8B, making it a CoE that can be used for general purpose NLP tasks.

Samba-1 Turbo Deepseek Coder 6.7B 4096 dynamic batching

deepseek-coder-6.7B-instruct

1,4,8,16

The Samba-1 Turbo Deepseek Coder 6.7B 4096 dynamic batching CoE contains the 4096 sequence length variant of deepseek-coder-6.7B-instruct. This CoE can be used for instruction-based coding tasks.

Samba-1 Turbo Llama 2 13B 4096 dynamic batching

Llama-2-13B-chat-hf

1,4,8,16

The Samba-1 Turbo Llama 2 13B 4096 dynamic batching CoE contains the 13B parameter variant of Llama 2. This CoE can be used for general-purpose tasks in a conversational or dialogue-based setting.

Samba-1 Turbo Llama 2 7B 4096 dynamic batching

llama-2-7B-chat-hf

1,4,8,16

The Samba-1 Turbo Llama 2 7B 4096 dynamic batching CoE contains the 7B parameter variant of Llama 2. This CoE can be used for general-purpose tasks in a conversational or dialogue-based setting.

Samba-1 Turbo Llama 3 70B 4096 dynamic batching

Meta-Llama-3-70B-Instruct

1,4,8,16

The Samba-1 Turbo Llama 3 70B 4096 dynamic batching CoE contains the 4096 sequence length variant of Llama 3 70B, making it a CoE that can be used for general purpose tasks with shorter-form inputs. Generally this results in a quicker first token latency.

Samba-1 Turbo Llama 3 70B 8192 dynamic batching

Meta-Llama-3-70B-Instruct

1,4,8,16

The Samba-1 Turbo Llama 3 70B 8192 dynamic batching CoE contains the 8192 sequence length variant of Llama 3 70B, making it a CoE that can be used for general purpose NLP tasks.

Samba-1 Turbo Llama 3 8B 4096 dynamic batching

Meta-Llama-3-8B-Instruct

1,4,8,16

The Samba-1 Turbo Llama 3 8B 4096 dynamic batching CoE contains the 4096 sequence length variant of Llama 3 8B, making it a relatively compact CoE that can be used for general purpose tasks with shorter-form inputs.

Samba-1 Turbo Llama 3 8B 8192 dynamic batching

Meta-Llama-3-8B-Instruct

1,4,8,16

The Samba-1 Turbo Llama 3 8B 8192 dynamic batching CoE contains the 8192 sequence length variant of Llama 3 8B, making it a relatively compact CoE that can be used for general purpose NLP tasks.

Samba-1 Turbo Mistral 7B 4096 dynamic batching

Mistral-7B-Instruct-V0.2

1,4,8

The Samba-1 Turbo Mistral 7B 4096 dynamic batching CoE contains the compact 7B parameter Mistral instruction following model. This CoE can be used for general purpose instruction following or assistant-like tasks with smaller-form inputs.

Samba-1 Turbo Llama 2 70B 4096 dynamic batching

llama-2-70b-chat-hf

1,4,8,16

The Samba-1 Turbo Llama 2 70B 4096 dynamic batching CoE contains the large Llama 2 70B parameter model variant. The CoE can be used for general purpose dialogue or chat-based applications along with smaller-form inputs.

Samba-1 Turbo Deepseek Coder 33B 4096 dynamic batching

deepseek-coder-33B-instruct-4096

1,4,8,16

The Samba-1 Turbo Deepseek Coder 33B 4096 dynamic batching CoE contains the 33B variant of deepseek-coder-instruct at a 4096 sequence length. This CoE can use used for relatively shorter-form input sizes for coding tasks.

Samba-1 Turbo Deepseek Coder 33B 16384 dynamic batching

deepseek-coder-33B-instruct-16384

1,4

This CoE contains the 33B variant of deepseek-coder-instruct at a 16384 sequence length. Because of the increased sequence length, this CoE can use used for coding-related tasks. This could include reading in larger coding blocks for interpretation and more.

Samba-1.1

94 expert models

1

Samba-1.1 was an iteration of Samba-1 and has been superseded with newer options. It contains 94 expert models.

Samba-1.0

56 expert models

1

Samba-1.0 Composition of Experts, strategically curated expert models from the open source community to offer state of the art accuracy at a diverse set of enterprise tasks and processes. This composition comprises 56 models.

Known limitations

  • When making requests to an inference endpoint as a user, you may occasionally encounter errors that are not directly related to your specific input. This is due to a known limitation where all requests within the same batch are processed as a group. As such, if one request in the batch fails the entire batch will fail and all requests will receive the same error message.

    • For example, consider a batch of four requests. If three of the requests are successful, but one request exceeds the maximum supported context length of the selected model, all four requests in the batch will fail. In this case, even though only one request exceeded the context length limit, the entire batch will return with the exceeding max context length error.

  • Cancelling requests for Samba 1 Turbo App models has a known limitation where the cancellation does not prevent the model from processing the requests.

    • For example, if a user sends 30 concurrent requests and immediately cancels all of them, the model will still process those requests. The can lead to slower time to first token (TTFT) for any subsequent requests, as the model is still busy handling the cancelled request.

  • Meta-Llama-3.1-8B-Instruct and Meta-Llama-3.1-70B-Instruct in this release support a maximum context length of 8k for this model. Support for longer context lengths up to 128k is targeted for a subsequent release.

  • The e5-mistral-7b-Instruct embedding models only support a batch size (BS) up to 8.

  • When using the e5-mistral-7b-Instruct embedding models in the Playground, an error may be displayed due to a mismatch between the predict and stream APIs. This is not a crash, but rather an expected behavior when using an embedding model in playground, which is not its typical usage.