Model Zoo release notes

These release notes summarize Model Zoo features and known issues.

The initial Model Zoo release is concurrent with SambaFlow release 1.21. Release notes for SambaFlow are in the SambaFlow software release notes.

Model Zoo 0.1.0

This is the first official release of SambaNova Model Zoo, which is currently in Beta. Here is a glimpse of Model Zoo features and limitations.

Overview

  • Model Zoo is available in a new public GitHub repository, SambaNova Model Zoo External link. The repository contains RDU-compatible model source code for popular open source models, along with libraries and example apps for efficiently compiling and running the models on SambaNova hardware.

  • SambaNova customers can download a container image (Devbox) that includes the SambaFlow compiler, other SambaNova libraries and all prerequisite software dependencies.

    • Existing SambaNova customers can contact their Customer Support representative for access to the Devbox.

    • If you’re new to SambaNova and interested in trying out Model Zoo, contact us at help@sambanovasystems.com to get started.

Details

  • The example apps support and demonstrate training and fine-tuning, evaluation, and text generation (inference) workflows on RDU.

  • Model Zoo supports NLP models that are based on a transformer architecture. Models available in the GitHub repo include:

    • Llama-2 7B, 13B, 70B

    • Llama-3 8B

    • Gemma 7B

    • Mistral 7B

  • You can use the Model Zoo source code and example apps with Hugging Face checkpoints that use bf16 or fp32 precision. For an example, see the walkthrough in our public GitHub repo.

  • The repo includes example apps for running training and inference on CPU so you can compare the RDU workflow with the CPU workflow. The CPU apps are primarily meant to illustrate differences and similarities, they have been tested only with Llama-2 7B.

  • You can customize model parameters and make other changes to the model. See Making changes to Model Zoo models.

  • You can further customize the RDU-compatible model source code within the constraints of supported PyTorch operators on RDUs. See Making changes to Model Zoo models.

  • When you run training or inference, the output includes a summary report at the end of each file that logs key metrics.

  • Model Zoo includes a validator that sends error messages if you are using a configuration that was not previously tested by SambaNova. You can set validate_config=False and proceed your own discretion. Model Zoo Best Practices explains what SambaNova tested.

  • Model Zoo supports advanced performance enhancing capabilities such as data parallel and tensor parallel, in preview mode with limited functionality.

Known issues and limitations

  • This Beta release of SambaNova Model Zoo allows you to run popular open source models on RDU. You can fine tune any model with your own data, and make other changes to configuration and source code. However, this Beta release doesn’t yet include all the performance knobs required for high-performance production training or inference. We recommend that you use this release primarily for any model experimentations.

  • Running in data parallel or tensor parallel mode is supported in Preview mode with limited functionality.

    • Data parallel. Due to the constraint of fixed host memory on each node, data parallel can run only up to 4 sockets for models with similar size to Llama-2 7B without running out of memory. The larger the model, the fewer number of replicas you can run. The out of memory issue happens during checkpoint loading when each worker loads its own checkpoint simultaneously. We are actively working on a sharded checkpoint loading API that will avoid this issue.

    • Tensor parallel. In this release of Model Zoo, we support tensor parallel only with Llama-2 70B. For that model, tensor parallel is required.

  • We have tested each model included in this release with a set of configuration parameter combinations. If you run with different parameters, the validator signals an error. You can set validate_config=False and continue experimenting at your own discretion. See Model Zoo Best Practices for details.

  • You can load checkpoints that are pretrained or finetuned with the training app into the inference app. Ensure that the inference app has the same model config as the training app that generated the checkpoint. Otherwise, the inference app’s model config takes precedence, and accuracy issues result.

Caveats

  • This is a Beta release. You may encounter bugs or limitations. To report issues, open a support case via an email to help@sambanova.ai or via the support portal support.sambanova.ai.

  • We appreciate your patience and your feedback as we work towards a more polished experience.

  • Your input will help shape the final product. We’re grateful for your participation in this early stage of development.

Documentation

This doc set includes some conceptual background, best practices, and troubleshooting information for Model Zoo. Step-by-step instructions for container setup, running training and inference, etc. are in the GitHub repo.

The SambaNova Model Zoo External link GitHub repo includes:

  • A top-level README External link file that gives an overview and points to other README files.

  • A document with instructions External link for setup of the container environment and the Devbox container.

  • An /examples README External link with step-by-step instructions for running inference with a Hugging Face checkpoint, running fine-tuning with a dataset and a Hugging Face checkpoint, and running data parallel training.

  • README files for /text_generation External link and /training External link that include Quick Run commands and discussions of differences between RDU and CPU example apps and workflows.

  • A README in the /models directory External link that discusses each file included for each model.

  • A model card README for our Llama External link, Gemma External link, and Mistral External link implementations.

In addition, this doc set includes the following documents:

  • Model Zoo architecture and workflows. Explores the Model Zoo architecture and workflows.

  • Get started with Model Zoo. Gets you started. You learn about the architecture and about the steps for running a modified Llama model on RDU hardware.

  • Model Zoo best practices. Learn from the expert how to pass in arguments, make changes to the example apps, and examine information about a training run.

  • Model Zoo troubleshooting. Troubleshooting information for Model Zoo users.

  • The SambaFlow API Reference has details about the classes, methods, and operators used by Model Zoo. NOTE: In some cases, the code contains operators (e.g. gather and scatter) that map to a corresponding sn_* operator (e.g. sn_gather and sn_scatter).