Model Zoo best practices

How to pass in arguments

Model Zoo uses the Hydra framework for managing app-level parameters. You manage arguments through the YAML config file and on the command line. Our example YAML files have detailed comments. In most cases, you only specify input and output paths and use the default for other parameters.

The YAML file includes a set of default parameters for compiling and running a model. You can add and modify parameters in the YAML file.
When you run Python scripts for an example app to compile your model:
- The app uses parameters and their values from the YAML file. See the text generation /config folder for some examples
- If a parameter doesn’t have a value in the YAML file, you have to specify it on the command line.
- You can override values that are specified in the YAML file on the command line.
- If you want to specify an argument that is not in the YAML file at all, you precede it with a + sign.
- For more information on overriding arguments, refer to https://hydra.cc/docs/advanced/override_grammar/basic/ .

In the following example, the command, model checkpoint path, and batch size are in the YAML file but we want to explicitly specify a value. The target SambaFlow version is not specified in the yaml so it should be added as an additional argument. Optionally, you may also supply a custom name for the PEF:

python rdu_generate_text.py \
  command=compile \
  checkpoint.model_name_or_path=PATH_TO_DOWNLOADED_MODEL \
  +samba_compile.target_sambaflow_version=MAJOR.MINOR.PATCH
  +samba_compile.pef_name=mypef

Recommended checkpoints

This section lists all recommended checkpoints.

Model Zoo is not limited to just these checkpoints, but is compatible with any bfloat16 or float32 precision Hugging Face checkpoint. Try Model Zoo compatible checkpoints, or use the list below as a starting point for Model Zoo.

Model Name	Recommended Checkpoints
llama 2 7b	https://huggingface.co/meta-llama/Llama-2-7b-chat-hf, https://huggingface.co/meta-llama/Llama-2-7b-hf
llama 2 13b	https://huggingface.co/meta-llama/Llama-2-13b-chat-hf, https://huggingface.co/meta-llama/Llama-2-13b-hf
llama 2 70b	https://huggingface.co/meta-llama/Llama-2-70b-chat-hf, https://huggingface.co/meta-llama/Llama-2-70b-hf
llama 3 8b	https://huggingface.co/meta-llama/Meta-Llama-3-8B
gemma 7b	https://huggingface.co/google/gemma-7b-it, https://huggingface.co/google/gemma-7b
mistral 7b	https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1, https://huggingface.co/mistralai/Mistral-7B-v0.1

Model Name

Recommended Checkpoints

llama 2 7b

https://huggingface.co/meta-llama/Llama-2-7b-chat-hf, https://huggingface.co/meta-llama/Llama-2-7b-hf

llama 2 13b

https://huggingface.co/meta-llama/Llama-2-13b-chat-hf, https://huggingface.co/meta-llama/Llama-2-13b-hf

llama 2 70b

https://huggingface.co/meta-llama/Llama-2-70b-chat-hf, https://huggingface.co/meta-llama/Llama-2-70b-hf

llama 3 8b

https://huggingface.co/meta-llama/Meta-Llama-3-8B

gemma 7b

https://huggingface.co/google/gemma-7b-it, https://huggingface.co/google/gemma-7b

mistral 7b

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1, https://huggingface.co/mistralai/Mistral-7B-v0.1

For quality of text generation, it is recommended to use the chat version of the checkpoint if it is available.

Making changes to Model Zoo models

You can experiment with Model Zoo parameters and make some changes to Model Zoo source code. In summary:

Change to Comment Recompile? See

Change to	Comment	Recompile?	See
`base_config.yaml`	Model-specific parameters such as `max_sequence_length` or `tensor_parallel`. Experiment with those parameters to see how they affect your model.	Usually No. Some exceptions.	Making changes to base_config.yaml parameters
`config.json`	Configuration associated with a checkpoint and downloaded when you download the checkpoint.	Yes	Making changes to config.json parameters
Source code	Source code that has been customized to work efficiently on RDU is included in the modelzoo repo. Experiment, for example by using a different operator supported by SambaFlow.	Yes	Making changes to source code

base_config.yaml

Model-specific parameters such as max_sequence_length or tensor_parallel. Experiment with those parameters to see how they affect your model.

Usually No. Some exceptions.

Making changes to base_config.yaml parameters

config.json

Configuration associated with a checkpoint and downloaded when you download the checkpoint.

Yes

Making changes to config.json parameters

Source code

Source code that has been customized to work efficiently on RDU is included in the modelzoo repo. Experiment, for example by using a different operator supported by SambaFlow.

Yes

Making changes to source code

Making changes to base_config.yaml parameters

Model Zoo uses Hydra and Pydantic for argument management (see How to pass in arguments). That means you can either set SambaNova specific parameter values in the base_config.yaml, or set them on the command line.

The following parameter value ranges are recommended. You can experiment with other parameter values.

For information about each of the parameters, see the commented base_comfig_*.yaml file in the public GitHub repo, for example, base_config_rdu.yam.

model:
use_segmented_softmax_attn: [false, true]
max_seq_length: [4096, 8192]

samba_compile:
tensor_parallel: [none]
n_chips: [1]
run_early_tp: [false]

generation:
batch_size: [1, 2, 4, 8] #needs recompile

All models were also tested with the default values in the base_config.yaml file.

For Llama 70B, base configuration models you must use Tensor Parallel mode to ensure the model fits on the RDU. Use these settings in the samba_compile section:

samba_compile:
tensor_parallel: weight
n_chips: 2
num_tiles: 8
early_tp: true

Making changes to config.json parameters

You download a config.json for the model when you download the Hugging Face checkpoint. If the changes to the base_config.yaml aren’t comprehensive enough for what you intend to do, you can experiment with making changes to config.json parameter.

Any changes to the config.json require a recompile.

We support all the base JSON configurations associated with recommended checkpoints (see Recommended checkpoints). If you use other parameter values, our validator displays an error.

See the Model Zoo Troubleshooting document for details on how to experiment with parameters we have not yet tested. Let us know which changes resulted in improvements, or if changes result in a failure to compile.

Making changes to source code

Model Zoo includes source code that has been customized to work efficiently on RDU. You can experiment with changes to the source code. In particular, consider exploring using a different operator in certain situations. Supported SambaFlow operators are in the SambaFlow API Reference .

A change to the source code always requires a recompile.

Let us know which changes resulted in improvements, or which changes result in a failure to compile.

Example: Changing the attention module

Here’s a simple example for making source code changes. The actual file paths depend in part on how you’re defining certain directories in your environment.

Make the following changes to the source files:

$HOME/sambanova_modelzoo/modelzoo/models/llama/modeling_llama.py

self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
+ self.o_proj_2 = nn.Linear(self.hidden_size, self.hidden_size, bias=False)

$HOME/sambanova_modelzoo/modelzoo/models/llama/patch_llama.py

if self.pretraining_tp > 1:
    attn_output = attn_output.split(self.hidden_size // self.pretraining_tp, dim=2)
    o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.pretraining_tp, dim=1)
    attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.pretraining_tp)])
else:
    attn_output = self.o_proj(attn_output)
+    attn_output = self.o_proj_2(attn_output)

Compile to generate a new PEF with these changes:

cd /opt/modelzoo/example/nlp/text_generation/

python rdu_generate_text.py command=compile checkpoint.model_name_or_path=/opt/ckpt_llama7b/fp32/  samba_compile.output_folder=/opt/out/ +samba_compile.pef_name=llama7b_Source_change +samba_compile.target_sambaflow_version=1.19.1

Verify that the changes are reflected in the PEF:

PEF_PATH=/opt/out/llama2_7b_infer_proj/llama2_7b_infer_proj.pef /opt/sambanova/bin/python -c "import os; from pypefapi import PyPefApi; pef = PyPefApi(os.environ.get('PEF_PATH')); print(pef.pypef.metadata)"

In the metadata that is sent to stdout, you should see the additional Linear operator:

(o_proj_2): Linear(in_features=4096, out_features=4096, bias=False)

You can now run the model with a command like the following:

python rdu_generate_text.py \
 command=run \
 checkpoint.model_name_or_path=/opt/ckpt_llama7b/fp32/ \
 samba_run.pef=/opt/out/llama7b_Schange/llama7b_Schange.pef

Information about model runs

After a model run, you can look at information about the run.

Training

When you run training, the app generates summary.txt and per_step_metrics.csv files.

The summary.txt file includes information like the following about the model run:

Number of epochs: 1
Per worker batch size: 2
Per worker number of batches (steps): 2
Number of DP workers: 2
Total tokens seen: 4914
Tokens per second: 120.8163
Average time per step: 20.3309s
The following are the model params used to train this model using Model Zoo:{"fp32_ln":false,"fp32_logits":true,"fp32_skip_add":true,"mixedp_attn":true,"max_seq_length":4096,"use_plugin_heuristics":false,"use_segmented_softmax_attn":false}

The per_step_metrics.csv file includes information like the following:

Tokens in Step,Step Loss,Learning Rate,Time per Step
tensor(2691),tensor(0.9211),1e-05,20.194304943084717
tensor(2223),tensor(0.2960),1e-05,20.467589616775513

Generation (inference)

When you run text generation, information about the run is sent to standard out. Here’s an example:

latencies
 time to first token 1.2131s
 tokens,  excluding first token 0.3460s
 tokens,  overall 0.3731s
 Total Latency 1.5592s
throughputs
 tokens/second excluding first token 2.8899
 tokens/second overall 2.6800