Create a training job
This document describes how to train a model by creating a new train model job, evaluate the job’s metrics, save a checkpoint to the model hub, and download logs of your training session.
Training jobs
You can fine-tune existing models by creating a Train model job. The platform will perform the required data processing, such as tokenization, behind the scenes. You can select either a platform provided dataset or your own dataset.
Create a train model job using the GUI
Create a training job using the GUI for fine-tuning by following the steps below.
-
Create a new project or use an existing one.
-
From a project window, click New job. The Create a new job window (Figure 1) will appear.
-
Select Train model under Create a new job, as shown in Figure 1.
-
Enter a name for the job into the Job name field, as shown in Figure 1.
-
Select the ML App from the ML App drop-down, as shown in Figure 1.
The ML App selected will refine the models displayed, by corresponding model type, in the Select model drop-down.
-
From the Select model drop-down, choose My models, Shared models, SambaNova models, or Select from Model Hub.
The available models displayed are defined by the previously selected ML App drop-down. If you wish to view models that are not related to the selected ML App, select Clear from the ML App drop-down. Selecting a model with the ML App drop-down cleared, will auto populate the ML App field with the correct and corresponding ML App for the model.
-
My models displays a list of models that you have previously added to the Model Hub.
-
Shared models displays a list of models that have been shared with the selected active tenant.
-
SambaNova models displays a list of models provided by SambaNova.
Figure 1. Train model job -
Select from Model Hub displays a window with a list of downloaded models that correspond to a selected ML App, as shown in Figure 2, or a list of all the downloaded models if an ML App is not selected. The list can be filtered by selecting options under Field of application, ML APP, Architecture, and Owner. Additionally, you can enter a term or value into the Search field to refine the model list by that input. Choose the model you wish to use and confirm your choice by clicking Use model.
Figure 2. Select from Model HubSelecting a generative tuning model, such as GPT_13B_Generic_Human_Aligned_v2, will display the No of RDUs drop-down. This drop-down provides the ability to select the number of SambaNova Systems' Reconfigurable Dataflow Units™ (RDUs) to utilize, resulting in faster training. The information
statement describes the required and currently available RDUs.
-
-
From the Select dataset drop-down, choose My datasets, SambaNova datasets, or Select from datasets.
Be sure to select a dataset that is prepared with the appropriate
max_seq_length
for your chosen model. 13B 8K SS models are compatible with datasets usingmax_seq_length=8192
. 13B 2K SS models are compatible with datasets usingmax_seq_length=2048
.-
My datasets displays a list of datasets that you have added to the platform and can be used for a selected ML App.
-
SambaNova datasets displays a list of platform provided datasets for a selected ML App.
-
Select from datasets displays the Dataset Hub window with a detailed list of datasets that can be used for a selected ML App, as shown in Figure 3. The My datasets and SambaNova checkboxes filter the dataset list by their respective group. The ML App drop-down filters the dataset list by the corresponding ML App. Choose the dataset you wish to use and confirm your choice by clicking Use dataset.
Figure 3. Dataset hub
-
-
Set the hyperparameters to govern your training job or use the default values. Expand the Hyperparameters & settings pane by clicking the blue double arrows to set hyperparameters and adjust settings, as shown in Figure 4.
To generate evaluation metrics for your checkpoints, the eval_steps and save_steps hyperparameters must be set to the same value. This ensures that the evaluation is performed on the saved checkpoints.
Figure 4. Hyperparameters & Settings -
Click Run job to queue the training job, as shown in Figure 4.
Training jobs are long running jobs. The completion time is heavily dependent on the hyperparameters set and the frequency of evaluation.
Create a training job using the CLI
The example below demonstrates how to create a training job using the snapi job create
command. You will need to specify the following:
-
A project to assign the job. Create a new project or use an existing one.
-
A name for your new job.
-
train
as the job type. This designates the job to be a training job. -
The model to be used for the training job.
-
The dataset you wish to use for the training job.
The dataset must be compatible with the model you choose.
$ snapi job create \
--project <project-name> \
--job <your-new-job-name> \
--type train \
--model-checkpoint <model-name> \
--dataset <dataset-name>
Run |
You can check the job status by running the snapi job info
command several times. You should see the status change each time the command is run. TRAINING
indicates the job is performing the task. You will need to specify the following:
-
The name of the project the job is assigned.
-
The name of the job you wish to view.
$ snapi job info \
--project <project-name> \
--job <job-name>
Run |
Evaluate the job using the GUI
Navigate to a Training job’s detail page during the job run (or after its completion) to view job information, generated checkpoints, and metrics. You can evaluate a checkpoints' accuracy, loss, and other metrics to determine if the checkpoint is of sufficient quality to deploy.
Navigate to a Training job’s detail page from the Dashboard or from its associated Project page. |
View information and metrics
You can view the following information and metrics about your training job.
- Model
-
Displays the model name and architecture used for training.
- Dataset
-
Displays the dataset used, including its size.
- Details & Hyperparameters
-
Displays the number of RDUs utilized and batch size. Click More to view a list of the hyperparameters and settings used during training. Click Less to hide the hyperparameters and settings list.
Figure 5. Expanded Details & Hyperparameters - Progress bar
-
The progress bar displays the state of the training job as well as the percentage completed of the training run.
- Metrics graph
-
Displays the various metrics generated during the training run. GPT 1.5B models, such as GPT_1.5B_NER_FINETUNED, generate additional metrics. Click Expand to view the additional metrics. Click Collapse to hide the additional metrics.
Figure 6. Expanded additional metrics - Checkpoints table
-
The Checkpoints table displays generated checkpoints of your training run.
-
You can customize your view of the Checkpoints table by enabling/disabling columns, from the Columns drop-down, to help you focus on comparing metrics that are relevant to you.
-
Download a CSV file of your checkpoints by clicking Export and selecting Download as CSV from the drop-down. The CSV file will be downloaded to the location configured by your browser.
-
From the Actions column drop-down, you can select Save to Model Hub or Delete for a checkpoint.
Figure 7. Checkpoints table -
For GPT 1.5B models, you can view the Confusion matrix for a checkpoint that can be used to further understand checkpoint performance. From the Actions drop-down, click Checkpoint metrics. The Confusion matrix window will open.
Figure 8. Confusion matrixAll labels listed in your labels file must be represented in the validation dataset. This ensures that the confusion matrix does not generate errors associated with missing labels or incorrectly attributed metrics.
-
Evaluate the job using the CLI
Similar to the GUI, the SambaNova API (snapi) provides feedback on job performance via the CLI. The example below demonstrates the snapi job metrics
command. You will need to specify the following:
-
The project that contains, or is assigned to, the job you wish to view metrics.
-
The name of the job you wish to view metrics.
If a Confusion Matrix can be generated for the job, the path to the generated matrix will be displayed in the output. |
$ snapi job metrics \
--project <project-name> \
--job <job-name>
TRAINING
INDEX TRAIN_LOSS TRAIN_STEPS
0 0.0 0.0
1 2.47 10.0
2 2.17 20.0
3 2.02 30.0
4 2.06 40.0
5 2.01 50.0
6 2.0 60.0
7 1.93 70.0
8 2.0 80.0
9 1.95 90.0
10 2.0 100.0
VALIDATION
INDEX VAL_STEPS VAL_LOSS VAL_STEPS_PER_SECOND
0 0.0 2.04 0.13
1 50.0 2.03 0.13
2 100.0 2.03 0.13
Confusion Matrix generated here -> <path-to-generated-confusion-matrix-jpeg>
Run |
Save a checkpoint to the Model Hub
Once you’ve identified a checkpoint to use for inference or further fine-tuning, follow the steps below to create a new model card and save it to the Model Hub.
-
Click the three dots in the Actions column of the checkpoint.
-
Select Save to Model Hub from the drop-down. The Add to Model Hub box will open.
Figure 9. Save to model hub -
Enter a name in the Model name field. This name will be the new model card name.
-
From the Type drop-down, choose the model type you wish to create.
-
The Share settings drop-down provides options for which tenant to share your model.
-
Share with <current-tenant> allows the model to be shared with the current tenant you are using, identified by its name in the drop-down.
-
Share with all tenants allows the model to be shared across all tenants.
-
Model will be shared with all users in <current-tenant> identifies that the model will be shared with other users in the tenant you are using.
If the Model will be shared with all users in <current-tenant> option is displayed, the Share with <current-tenant> and Share with all tenants options described above will not be available. Share with all tenants is an optional feature of SambaStudio. Please contact your administrator or SambaNova representative for more information.
-
-
Click Add to Model Hub to create the model card and save the checkpoint.

View and download logs using the GUI
The Logs section allows you to preview and download logs of your training session. Logs can help you track progress, identify errors, and determine the cause of potential errors.
Logs can be visible in the platform earlier than other data, such as metrics, checkpoints, and job progress. |
-
From the Preview drop-down, select the log file you wish to preview.
-
The Preview window displays the latest 50 lines of the log.
-
To view more than 50 lines of the log, use the Download all feature to download the log file.
-
-
Click Download all to download a compressed file of your logs. The file will be downloaded to the location configured by your browser.

View logs using the CLI
Similar to viewing logs using the GUI, you can use the SambaNova API (snapi) to preview and download logs of your training session.
View the job log file names
The example below demonstrates the snapi job list-logs
command. Use this command to view the job log file names of your training job. This is similar to using the Preview drop-down menu in the GUI to view and select your job log file names. You will need to specify the following:
-
The project that contains, or is assigned to, the job you wish to view the job log file names.
-
The name of the job you wish to view the job log file names.
$ snapi job list-logs \
--project <project-name> \
--job <job-name>
train-0fb0568c-ca8e-4771-b7cf-e6ef156d1347-1-ncc9n-runner.log
train-0fb0568c-ca8e-4771-b7cf-e6ef156d1347-1-ncc9n-model.log
Run |
Preview a log file
After you have viewed the log file names for your training job, you can use the snapi job preview-log
command to preview the logs corresponding to a selected log file. The example below demonstrates the snapi job preview-log
command. You will need to specify the following:
-
The project that contains, or is assigned to, the job you wish to preview the job log file.
-
The name of the job you wish to preview the job log file.
-
The job log file name you wish to preview its logs. This file name is returned by running the
snapi job list-logs
command, which is described above.
$ snapi job preview-log \
--project <project-name> \
--job <job-name> \
--file train-0fb0568c-ca8e-4771-b7cf-e6ef156d1347-1-ncc9n-runner.log
2023-08-10 20:28:46 - INFO - 20be599f-9ea7-44ea-9dc5-b97294d97529 - Runner starting...
2023-08-10 20:28:46 - INFO - 20be599f-9ea7-44ea-9dc5-b97294d97529 - Runner successfully started
2023-08-10 20:28:46 - INFO - 20be599f-9ea7-44ea-9dc5-b97294d97529 - Received new train request
2023-08-10 20:28:46 - INFO - 20be599f-9ea7-44ea-9dc5-b97294d97529 - Connecting to modelbox at localhost:50061
2023-08-10 20:28:54 - INFO - 20be599f-9ea7-44ea-9dc5-b97294d97529 - Running training
2023-08-10 20:28:54 - INFO - 20be599f-9ea7-44ea-9dc5-b97294d97529 - Staging dataset
2023-08-10 20:28:54 - INFO - 20be599f-9ea7-44ea-9dc5-b97294d97529 - initializing metrics for modelbox:0
2023-08-10 20:28:54 - INFO - 20be599f-9ea7-44ea-9dc5-b97294d97529 - initializing checkpoint path for modelbox:0
2023-08-10 20:31:35 - INFO - 20be599f-9ea7-44ea-9dc5-b97294d97529 - Preparing training for modelbox:0
2023-08-10 20:31:35 - INFO - 20be599f-9ea7-44ea-9dc5-b97294d97529 - Running training for modelbox
Run |
Download the logs
Use the snapi download-logs
command to download a compressed file of your training job’s logs. The example below demonstrates the snapi download-logs
command. You will need to provide the following:
-
The project that contains, or is assigned to, the job you wish to download the compressed log file.
-
The name of the job you wish to download the compressed log file.
$ snapi job download-logs \
--project <project-name>> \
--job <job-name>
Successfully Downloaded: <job-name> logs
The default destination for the compressed file download is the current directory. To specify a destination directory, use the |