Create a training job

Training jobs

You can fine-tune existing models by creating a training job. The platform will perform the required data processing, such as tokenization, behind the scenes. You can select either a platform provided dataset or your own dataset.

Create a train model job using the GUI

Create a training job using the GUI for fine-tuning by following the steps below.

Training jobs are long running jobs. The completion time is heavily dependent on the hyperparameters set and the frequency of evaluation.

  1. Create a new project or use an existing one.

  2. From a project window, click New job. The Create a new job window (Figure 1) will appear.

  3. Select Train model under Create a new job, as shown in Figure 1.

  4. Enter a name for the job into the Job name field, as shown in Figure 1.

  5. Select the ML App from the ML App drop-down, as shown in Figure 1.

    The ML App selected will refine the models displayed, by corresponding model type, in the Select model drop-down.

  6. From the Select model drop-down, choose My models, Shared models, SambaNova models, or Select from Model Hub.

    The available models displayed are defined by the previously selected ML App drop-down. If you wish to view models that are not related to the selected ML App, select Clear from the ML App drop-down. Selecting a model with the ML App drop-down cleared, will auto populate the ML App field with the correct and corresponding ML App for the model.

    1. My models displays a list of models that you have previously added to the Model Hub.

    2. Shared models displays a list of models that have been shared with the selected active tenant.

    3. SambaNova models displays a list of models provided by SambaNova.

      Train model
      Figure 1. Train model job
    4. Select from Model Hub displays a window with a list of downloaded models that correspond to a selected ML App, as shown in Figure 2, or a list of all the downloaded models if an ML App is not selected. The list can be filtered by selecting options under Field of application, ML APP, Architecture, and Owner. Additionally, you can enter a term or value into the Search field to refine the model list by that input. Choose the model you wish to use and confirm your choice by clicking Use model.

      Model Hub
      Figure 2. Select from Model Hub

      Selecting a generative tuning model, such as GPT_13B_Generic_Human_Aligned_v2, will display the No of RDUs drop-down. This drop-down provides the ability to select the number of SambaNova Systems' Reconfigurable Dataflow Units™ (RDUs) to utilize, resulting in faster training. The information Information icon statement describes the required and currently available RDUs.

      Generative tuning RDUs
  7. From the Select dataset drop-down, choose My datasets, SambaNova datasets, or Select from datasets.

    Be sure to select a dataset that is prepared with the appropriate max_seq_length for your chosen model. 13B 8K SS models are compatible with datasets using max_seq_length=8192. 13B 2K SS models are compatible with datasets using max_seq_length=2048.

    1. My datasets displays a list of datasets that you have added to the platform and can be used for a selected ML App.

    2. SambaNova datasets displays a list of downloaded SambaStudio provided datasets that correspond to a selected ML App.

    3. Select from datasets displays the Dataset Hub window with a detailed list of downloaded datasets that can be used for a selected ML App, as shown in Figure 3. The My datasets and SambaNova checkboxes filter the dataset list by their respective group. The ML App drop-down filters the dataset list by the corresponding ML App. Choose the dataset you wish to use and confirm your choice by clicking Use dataset.

      Dataset Hub
      Figure 3. Dataset hub
  8. Set the hyperparameters to govern your training job or use the default values. Expand the Hyperparameters & settings pane by clicking the blue double arrows to set hyperparameters and adjust settings, as shown in Figure 4.

    To generate evaluation metrics for your checkpoints, the eval_steps and save_steps hyperparameters must be set to the same value. This ensures that the evaluation is performed on the saved checkpoints.

    Hyperparameters and settings
    Figure 4. Hyperparameters & Settings
  9. Click Run job to submit the training job, as shown in Figure 4.

    1. If the required amount of storage space is not available to create the job, the Insufficient storage message (Figure 5) will display describing the Available space and the Required space to create the job. You will need to free up storage space or contact your administrator. Please choose one of the following options.

      1. Click Cancel to stop the job from being created. Please free up storage space and then restart the Create a train model job using the GUI process.

      2. Click Proceed anyway to submit the job to be created. Please free up storage space, otherwise the job will fail to be created and not train.

        A minimum of 10 minutes is required after sufficient storage space has been cleared before the job creation will successfully start.

        Insufficient storage message
        Figure 5. Example insufficient storage message for a training job

Create a training job using the CLI

The example below demonstrates how to create a training job using the snapi job create command. You will need to specify the following:

  • A project to assign the job. Create a new project or use an existing one.

  • A name for your new job.

  • Use train as the job type. This designates the job to be a training job.

  • The model to be used for the training job.

  • The dataset you wish to use for the training job.

    The dataset must be compatible with the model you choose.

Example snapi job create command
$ snapi job create \
   --project <project-name> \
   --job <your-new-job-name> \
   --type train \
   --model-checkpoint <model-name> \
   --dataset <dataset-name>

Run snapi job create --help to display additional usage and options.

You can check the job status by running the snapi job info command several times. You should see the status change each time the command is run. TRAINING indicates the job is performing the task. You will need to specify the following:

  • The name of the project the job is assigned.

  • The name of the job you wish to view.

Example snapi job info command
$ snapi job info \
   --project <project-name> \
   --job <job-name>

Run snapi job info --help to display additional usage and options.

Evaluate jobs

Evaluate your job by viewing detailed information about its performance using the GUI or CLI.

Evaluate the job using the GUI

Navigate to a Training job’s detail page during the job run (or after its completion) to view job information, generated checkpoints, and metrics. You can evaluate a checkpoints' accuracy, loss, and other metrics to determine if the checkpoint is of sufficient quality to deploy.

Navigate to a Training job’s detail page from the Dashboard or from its associated Project page.

View information and metrics using the GUI

You can view the following information and metrics about your training job.

Model

Displays the model name and architecture used for training.

Dataset

Displays the dataset used, including its size.

Details & Hyperparameters

Displays the number of RDUs utilized and batch size. Click More to view a list of the hyperparameters and settings used during training. Click Less to hide the hyperparameters and settings list.

Generative tuning RDUs
Figure 6. Expanded Details & Hyperparameters
Progress bar

The progress bar displays the state of the training job as well as the percentage completed of the training run.

Metrics graph

Displays the various metrics generated during the training run. GPT 1.5B models, such as GPT_1.5B_NER_FINETUNED, generate additional metrics. Click Expand to view the additional metrics. Click Collapse to hide the additional metrics.

Expanded metrics
Figure 7. Expanded additional metrics
Checkpoints table

The Checkpoints table displays generated checkpoints of your training run.

  • You can customize your view of the Checkpoints table by enabling/disabling columns, from the Columns drop-down, to help you focus on comparing metrics that are relevant to you.

  • Download a CSV file of your checkpoints by clicking Export and selecting Download as CSV from the drop-down. The CSV file will be downloaded to the location configured by your browser.

  • From the Actions column drop-down, you can click Create new job, Save to Model Hub, or Delete for all checkpoints.

    Checkpoints table
    Figure 8. Checkpoints table
  • For GPT 1.5B model checkpoints, you can click Checkpoint metrics (Figure 8) to view the Confusion matrix (Figure 9), which can be used to further understand checkpoint performance.

    Confusion matrix
    Figure 9. Confusion matrix

    All labels listed in your labels file must be represented in the validation dataset. This ensures that the confusion matrix does not generate errors associated with missing labels or incorrectly attributed metrics.

Evaluate the job using the CLI

Similar to the GUI, the SambaNova API (snapi) provides feedback on job performance via the CLI.

View job information using the CLI

The example below demonstrates the snapi job info command used to provide information about your job, including its Job and Project IDs, Status, and its training settings. You will need to specify the following:

  • The project that contains, or is assigned to, the job you wish to view information.

  • The name of the job you wish to view information.

Example snapi job info command
$ snapi job info \
   --project <project-name> \
   --job <job-name>

               Job Info
             ============
Name               : <job-name>
Job ID             : 63cc9ac3-2f1e-444e-9bfe-02a2ba5a381a
Type               : train
Project ID         : ce7d60d8-81ae-44ac-a2a2-6dcb5001bf3a
Status             : EXIT_WITH_0
App                : Generative Tuning 1.5B
Dataset            : GPT_1.5B_Training_Dataset
Input path         : common/datasets/ggt_sentiment_analysis/hdf5_single_avoid_overflow/hdf5
Model Checkpoint   : GPT_1.5B_GT_Pretrained
Hyper Parameters   : [{'param_name': 'batch_size', 'value': '16', 'description': 'Number of samples in a batch'}, {'param_name': 'do_eval', 'value': 'true', 'description': 'whether or not to do final evaluation'}, {'param_name': 'eval_steps', 'value': '50', 'description': "Period of evaluating the model in number of training steps. This parameter is only effective when evaluation_strategy is set to 'steps'."}, {'param_name': 'evaluation_strategy', 'value': 'steps', 'description': 'Strategy to validate the model during training'}, {'param_name': 'learning_rate', 'value': '7.5e-06', 'description': 'learning rate to use in optimizer'}, {'param_name': 'logging_steps', 'value': '10', 'description': 'Period of logging training loss in number of training steps'}, {'param_name': 'lr_schedule', 'value': 'cosine_schedule_with_warmup', 'description': 'Type of learning rate scheduler to use'}, {'param_name': 'max_seq_length', 'value': '1024', 'description': 'Sequence length to pad or truncate the dataset'}, {'param_name': 'num_iterations', 'value': 10, 'description': 'number of iterations to run'}, {'param_name': 'precision', 'value': 'bf16_all', 'description': 'Controls which operators will use bf16 v.s. fp32 precision'}, {'param_name': 'prompt_loss_weight', 'value': '0.1', 'description': 'Loss scale for prompt tokens'}, {'param_name': 'save_optimizer_state', 'value': 'true', 'description': 'Whether to save the optimizer state when saving a checkpoint'}, {'param_name': 'save_steps', 'value': '50', 'description': 'Period of saving the model checkpoints in number of training steps'}, {'param_name': 'subsample_eval', 'value': '0.01', 'description': 'Subsample for the evaluation dataset'}, {'param_name': 'subsample_eval_seed', 'value': '123', 'description': 'Random seed to use for the subsample evaluation'}, {'param_name': 'use_token_type_ids', 'value': 'true', 'description': 'Whether to use token_type_ids to compute loss'}, {'param_name': 'warmup_steps', 'value': '0', 'description': 'warmup steps to use in learning rate scheduler in optimizer'}, {'param_name': 'weight_decay', 'value': '0.1', 'description': 'weight decay rate to use in optimizer'}, {'param_name': 'selected_rdus', 'value': 1, 'description': 'Number of RDUs each instance of the model uses'}]
RDUs Needed        : 1
Parallel Instances : 1
Created Time       : 2023-05-11T21:41:50.455326+00:00
Updated Time       : 2023-06-20T20:10:14.800945+00:00

Run snapi job info --help to display additional usage and options.

View metrics using the CLI

The example below demonstrates the snapi job metrics command used to provide job performance metrics. You will need to specify the following:

  • The project that contains, or is assigned to, the job you wish to view metrics.

  • The name of the job you wish to view metrics.

If a Confusion Matrix can be generated for the job, the path to the generated matrix will be displayed in the output.

Example snapi job metrics command
$ snapi job metrics \
   --project <project-name> \
   --job <job-name>

TRAINING
INDEX        TRAIN_LOSS        TRAIN_STEPS
  0             0.0                0.0
  1             2.54               10.0

VALIDATION
INDEX        VAL_STEPS        VAL_LOSS        VAL_STEPS_PER_SECOND
  0             0.0             1.82                  0.81

Confusion Matrix generated here -> <path-to-generated-confusion-matrix-jpeg>

Run snapi job metrics --help to display additional usage and options.

View the checkpoints using the CLI

The snapi checkpoint list command allows you to view the list of generated checkpoints from your job. You will need to specify the following:

  • The project that contains, or is assigned to, the job you wish to view its generated checkpoints.

  • The name of the job you wish to view its generated checkpoints.

Example snapi checkpoint list command
$ snapi checkpoint list \
   --project <project-name> \
   --job <job-name>
CHECKPOINT NAME                                STEPS        LABELS        VALIDATION LOSS        VALIDATION ACCURACY        CREATED TIME                            DEPENDENT JOBS
63cc9ac3-2f1e-444e-9bfe-02a2ba5a381a-10        10           None          1.82                   None                       2023-05-11T21:44:39.964983+00:00

Run snapi checkpoint list --help to display additional usage and options.

View detailed checkpoint information using the CLI

The example below demonstrates the snapi checkpoint info command used to provide detailed information about a checkpoint. You will need to specify the name of the checkpoint, which you can obtain by running the snapi checkpoint list command.

'USER_MODIFIABLE': True indicates that the parameter is adjustable.

Example snapi checkpoint info command
$ snapi checkpoint info \
   --checkpoint-name 63cc9ac3-2f1e-444e-9bfe-02a2ba5a381a-10

             Checkpoint Info
             ===============
Name                : 63cc9ac3-2f1e-444e-9bfe-02a2ba5a381a-10
Application Field   : None
Architecture        : None
Time Created        : 2023-05-11T21:44:39.964983+00:00
Validation Loss     : 1.82
Validation Acc      : None
ML App              : Generative Tuning 1.5B
Labels              : None
Job ID              : 63cc9ac3-2f1e-444e-9bfe-02a2ba5a381a
Steps               : 10
Dependent Jobs      :
Hyperparameters     : [   {   'CONSTRAINTS': {'values': ['6', '16']},
        'DATATYPE': 'int',
        'DESCRIPTION': 'Number of samples in a batch',
        'FIELD_NAME': 'batch_size',
        'MESSAGE': 'Value must be one of (6, 16)',
        'TASK_TYPE': ['compile', 'infer', 'serve', 'train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': '16',
                                                   'USER_MODIFIABLE': True}}},
    {   'CONSTRAINTS': {'values': ['true', 'false']},
        'DATATYPE': 'bool',
        'DESCRIPTION': 'whether or not to do final evaluation',
        'FIELD_NAME': 'do_eval',
        'MESSAGE': 'Value must be one of (True, False)',
        'TASK_TYPE': ['train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': 'true',
                                                   'USER_MODIFIABLE': True}}},
    {   'CONSTRAINTS': {'ge': '1'},
        'DATATYPE': 'int',
        'DESCRIPTION': 'Period of evaluating the model in number of training '
                       'steps. This parameter is only effective when '
                       "evaluation_strategy is set to 'steps'.",
        'FIELD_NAME': 'eval_steps',
        'MESSAGE': 'Value must be greater than or equal to 1',
        'TASK_TYPE': ['train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': '50',
                                                   'USER_MODIFIABLE': True}}},
    {   'CONSTRAINTS': {'values': ['no', 'steps', 'epoch']},
        'DATATYPE': 'str',
        'DESCRIPTION': 'Strategy to validate the model during training',
        'FIELD_NAME': 'evaluation_strategy',
        'MESSAGE': 'Value must be one of (no, steps, epoch)',
        'TASK_TYPE': ['train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': 'steps',
                                                   'USER_MODIFIABLE': True}}},
    {   'CONSTRAINTS': {'ge': '0'},
        'DATATYPE': 'float',
        'DESCRIPTION': 'learning rate to use in optimizer',
        'FIELD_NAME': 'learning_rate',
        'MESSAGE': 'Value must be greater than or equal to 0.0',
        'TASK_TYPE': ['train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': '7.5e-06',
                                                   'USER_MODIFIABLE': True}}},
    {   'CONSTRAINTS': {'ge': '1'},
        'DATATYPE': 'int',
        'DESCRIPTION': 'Period of logging training loss in number of training '
                       'steps',
        'FIELD_NAME': 'logging_steps',
        'MESSAGE': 'Value must be greater than or equal to 1',
        'TASK_TYPE': ['train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': '10',
                                                   'USER_MODIFIABLE': True}}},
    {   'CONSTRAINTS': {   'values': [   'polynomial_decay_schedule_with_warmup',
                                         'cosine_schedule_with_warmup',
                                         'fixed_lr']},
        'DATATYPE': 'str',
        'DESCRIPTION': 'Type of learning rate scheduler to use',
        'FIELD_NAME': 'lr_schedule',
        'MESSAGE': 'Value must be one of '
                   '(polynomial_decay_schedule_with_warmup, '
                   'cosine_schedule_with_warmup, fixed_lr)',
        'TASK_TYPE': ['train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': 'cosine_schedule_with_warmup',
                                                   'USER_MODIFIABLE': True}}},
    {   'CONSTRAINTS': {'values': ['1024']},
        'DATATYPE': 'int',
        'DESCRIPTION': 'Sequence length to pad or truncate the dataset',
        'FIELD_NAME': 'max_seq_length',
        'MESSAGE': 'Value must be one of (1024)',
        'TASK_TYPE': ['compile', 'infer', 'serve', 'train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': '1024',
                                                   'USER_MODIFIABLE': False}}},
    {   'CONSTRAINTS': {'ge': '1'},
        'DATATYPE': 'int',
        'DESCRIPTION': 'number of iterations to run',
        'FIELD_NAME': 'num_iterations',
        'MESSAGE': 'Value must be greater than or equal to 1',
        'TASK_TYPE': ['train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': 10,
                                                   'USER_MODIFIABLE': True}}},
    {   'CONSTRAINTS': {   'values': [   'bf16_all', 'mixp_safe', 'mixp_fast',
                                         'fp32_optimizer']},
        'DATATYPE': 'str',
        'DESCRIPTION': 'Controls which operators will use bf16 v.s. fp32 '
                       'precision',
        'FIELD_NAME': 'precision',
        'MESSAGE': 'Value must be one of (bf16_all, mixp_safe, mixp_fast, '
                   'fp32_optimizer)',
        'TASK_TYPE': ['compile', 'train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': 'bf16_all',
                                                   'USER_MODIFIABLE': True}}},
    {   'CONSTRAINTS': {'ge': '0'},
        'DATATYPE': 'float',
        'DESCRIPTION': 'Loss scale for prompt tokens',
        'FIELD_NAME': 'prompt_loss_weight',
        'MESSAGE': 'Value must be greater than or equal to 0.0',
        'TASK_TYPE': ['train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': '0.1',
                                                   'USER_MODIFIABLE': True}}},
    {   'CONSTRAINTS': {'values': ['true', 'false']},
        'DATATYPE': 'bool',
        'DESCRIPTION': 'Whether to save the optimizer state when saving a '
                       'checkpoint',
        'FIELD_NAME': 'save_optimizer_state',
        'MESSAGE': 'Value must be one of (True, False)',
        'TASK_TYPE': ['train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': 'true',
                                                   'USER_MODIFIABLE': True}}},
    {   'CONSTRAINTS': {'ge': '1'},
        'DATATYPE': 'int',
        'DESCRIPTION': 'Period of saving the model checkpoints in number of '
                       'training steps',
        'FIELD_NAME': 'save_steps',
        'MESSAGE': 'Value must be greater than or equal to 1',
        'TASK_TYPE': ['train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': '50',
                                                   'USER_MODIFIABLE': True}}},
    {   'CONSTRAINTS': {'ge': '0'},
        'DATATYPE': 'float',
        'DESCRIPTION': 'Subsample for the evaluation dataset',
        'FIELD_NAME': 'subsample_eval',
        'MESSAGE': 'Value must be greater than or equal to 0.0',
        'TASK_TYPE': ['train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': '0.01',
                                                   'USER_MODIFIABLE': True}}},
    {   'CONSTRAINTS': {'ge': '1'},
        'DATATYPE': 'int',
        'DESCRIPTION': 'Random seed to use for the subsample evaluation',
        'FIELD_NAME': 'subsample_eval_seed',
        'MESSAGE': 'Value must be greater than or equal to 1',
        'TASK_TYPE': ['train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': '123',
                                                   'USER_MODIFIABLE': True}}},
    {   'CONSTRAINTS': {'values': ['true', 'false']},
        'DATATYPE': 'bool',
        'DESCRIPTION': 'Whether to use token_type_ids to compute loss',
        'FIELD_NAME': 'use_token_type_ids',
        'MESSAGE': 'Value must be one of (True, False)',
        'TASK_TYPE': ['train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': 'true',
                                                   'USER_MODIFIABLE': True}}},
    {   'CONSTRAINTS': {'ge': '0'},
        'DATATYPE': 'int',
        'DESCRIPTION': 'warmup steps to use in learning rate scheduler in '
                       'optimizer',
        'FIELD_NAME': 'warmup_steps',
        'MESSAGE': 'Value must be greater than or equal to 0',
        'TASK_TYPE': ['train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': '0',
                                                   'USER_MODIFIABLE': True}}},
    {   'CONSTRAINTS': {'ge': '0'},
        'DATATYPE': 'float',
        'DESCRIPTION': 'weight decay rate to use in optimizer',
        'FIELD_NAME': 'weight_decay',
        'MESSAGE': 'Value must be greater than or equal to 0.0',
        'TASK_TYPE': ['infer', 'serve', 'train'],
        'TYPE_SPECIFIC_SETTINGS': {   'train': {   'DEFAULT': '0.1',
                                                   'USER_MODIFIABLE': True}}}]
Params              : {   'invalidates_checkpoint': {'max_seq_length': 1024},
    'modifiable': {   'adam_beta1': 0.9,
                      'adam_beta2': 0.999,
                      'adam_epsilon': 1e-08,
                      'batch_size': 16,
                      'doc_stride': 128,
                      'learning_rate': 1.5e-05,
                      'max_answer_length': 30,
                      'max_query_length': 64,
                      'max_source_length': 512,
                      'max_target_length': 114,
                      'n_best_size': 20,
                      'num_iterations': 7500,
                      'warmup_steps': 0,
                      'weight_decay': 0.1}}

Create training jobs from checkpoints

Creating a new training job directly from a checkpoint enables faster development and experimentation. New training jobs can be created from existing checkpoints using the GUI or the CLI. Follow the instructions described in the corresponding section to learn how.

Create a training job from a checkpoint using the GUI

When creating a new training job directly from a checkpoint using the GUI, the ML App and Select model fields are auto-populated based on the original training job’s selections.

Jobs created from checkpoints using the GUI will always start at step 0. To start the job at the step of the checkpoint, please use the CLI procedure and include the --load-state option.

Follow the steps below to create a new training job from a checkpoint.

  1. Click the three dots in the Actions column of the checkpoint.

  2. Select Create new job from the drop-down. The Create a new job from checkpoint window will open.

    Checkpoints drop-down
    Figure 10. Checkpoints drop-down
  3. Enter a name for the job into the Job name field.

  4. Choose a dataset, or use the original job’s dataset, from the Select dataset drop-down.

  5. Set the hyperparameters to govern your training job or use the default values. Expand the Hyperparameters & settings pane by clicking the blue double arrows to set hyperparameters and adjust settings.

  6. Click Run job to submit the new training job to be created.

    Create new training job from a checkpoint
    Figure 11. Create new training job from a checkpoint
    1. If the required amount of storage space is not available to create the job, the Insufficient storage message will display describing the Available space and the Required space to create the job. You will need to free up storage space or contact your administrator. Please choose one of the following options.

      1. Click Cancel to stop the job from being created. Please free up storage space and then restart the Create a training job from a checkpoint using the GUI process.

      2. Click Proceed anyway to submit the job to be created. Please free up storage space, otherwise the job will fail to be created and not train.

        A minimum of 10 minutes is required after sufficient storage space has been cleared before the job creation will successfully start.

        Insufficient storage message
        Figure 12. Example insufficient storage message for a creating a job from a checkpoint
  7. The Details & Hyperparameters panel, in the new job’s details page, provides information associated with the new job. Click More to expand the panel and view additional information. Click Less to collapse the panel and hide the additional information.

    1. Job submitted displays the date and time the job was created.

    2. Source checkpoint indicates the checkpoint used to create the job.

    3. Source job identifies the original job used to train the source checkpoint. Click the source job name to navigate to the source job’s detail page.

      Train job from checkpoint details
      Figure 13. Train job from checkpoint details

Create a training job from a checkpoint using the CLI

To create a new training job from a checkpoint, you will first need to identify the checkpoint you wish to use by running the snapi checkpoint list command. You then use the snapi job create command to create the new job from the identified checkpoint. You will need to specify the following:

  • A project to assign the job. Create a new project or use the project from the originating job.

  • A name for your new job.

  • Use train as the job type. This designates the job to be a training job.

  • Use the identified checkpoint name you want to start the training job from for the model-checkpoint input.

  • The dataset you wish to use for the new training job.

  • To start the new training job from the step of the identified checkpoint, include the --load-state option, which loads the entire state of the checkpoint.

    • If the --load-state option is not included, the training job will start at step 0.

Example snapi job create command from a selected checkpoint
$ snapi job create \
   --project <project-name> \
   --job <job-name> \
   --type train \
   --model-checkpoint 63cc9ac3-2f1e-444e-9bfe-02a2ba5a381a-10 \
   --dataset <dataset-name> \
   --load-state
Successfully created job id: 4682335f-469a-4409-92df-66d92466cc69.

Run snapi job create --help to display additional usage and options.

Save/add checkpoints to the Model Hub

Once you’ve identified a checkpoint to use for inference or further fine-tuning, follow the steps below to save a checkpoint to the Model Hub using the GUI, or add a checkpoint to the model list using the CLI.

Save a checkpoint to the Model Hub using the GUI

Follow the steps below to create a new model card and save it to the Model Hub.

  1. From the Checkpoints table, click the three dots in the Actions column of the checkpoint you wish to save.

  2. Select Save to Model Hub from the drop-down. The Add to Model Hub box will open.

    Checkpoints drop-down
    Figure 14. Checkpoints drop-down
  3. Enter a name in the Model name field. This name will be the new model card name.

  4. From the Type drop-down, choose the model type you wish to create.

  5. The Share settings drop-down provides options for which tenant to share your model.

    1. Share with <current-tenant> allows the model to be shared with the current tenant you are using, identified by its name in the drop-down.

    2. Share with all tenants allows the model to be shared across all tenants.

    3. Model will be shared with all users in <current-tenant> identifies that the model will be shared with other users in the tenant you are using.

      If the Model will be shared with all users in <current-tenant> option is displayed, the Share with <current-tenant> and Share with all tenants options described above will not be available. Share with all tenants is an optional feature of SambaStudio. Please contact your administrator or SambaNova representative for more information.

  6. Click Add to Model Hub to create the model card and save the checkpoint.

    Add to model hub
    Figure 15. Add to model hub
    1. If the required amount of storage space is not available to save the checkpoint, the Insufficient storage message (Figure 16) will display describing the Available space and the Required space to save the checkpoint. You will need to free up storage space or contact your administrator. Please choose one of the following options.

      1. Click Cancel to stop the save a checkpoint process. Please free up storage space and then restart the Save a checkpoint to the Model Hub using the GUI process.

      2. Click Proceed anyway to save the checkpoint. Please free up storage space, otherwise the save a checkpoint to the Model Hub process will fail.

        A minimum of 10 minutes is required after sufficient storage space has been cleared before the checkpoint will start successfully saving to the Model Hub.

        Insufficient storage message
        Figure 16. Example insufficient storage message for saving a checkpoint

Add a checkpoint to the model list using the CLI

To add a checkpoint to the model list, you will first need to identify the checkpoint you wish to add by running the snapi checkpoint list command. You then use the snapi model add command to add the identified checkpoint to the model list. You will need to specify the following:

  • The project that contains, or is assigned to, the job and checkpoint you wish to add to the model list.

  • The name of the job that contains the checkpoint you wish to add to the model list.

  • Use the identified checkpoint name you want to add to the model list for the model-checkpoint input.

  • Enter a new name that will appear in the model list for the model-checkpoint-name input.

  • Provide the checkpoint type as either finetuned or pretrained.

Example snapi model add command
$ snapi model add \
   --project <project-of-checkpoint. \
   --job <job-of-checkpoint> \
   --model-checkpoint 63cc9ac3-2f1e-444e-9bfe-02a2ba5a381a-10 \
   --model-checkpoint-name <new-name-for-model-list> \
   --checkpoint-type pretrained
Successfully added <new-name-for-model-list>

Run snapi model add --help to display additional usage and options.

View and download logs

Job logs can help you track progress, identify errors, and determine the cause of potential errors. You can view and download logs by using the GUI or CLI.

View and download logs using the GUI

The Logs section of the GUI allows you to preview and download logs of your training session.

Logs can be visible in the platform earlier than other data, such as metrics, checkpoints, and job progress.

  1. From the Preview drop-down, select the log file you wish to preview.

    1. The Preview window displays the latest 50 lines of the log.

    2. To view more than 50 lines of the log, use the Download all feature to download the log file.

  2. Click Download all to download a compressed file of your logs. The file will be downloaded to the location configured by your browser.

Logs
Figure 17. Logs

View and download logs using the CLI

Similar to viewing logs using the GUI, you can use the SambaNova API (snapi) to preview and download logs of your training session.

View the job log file names

The example below demonstrates the snapi job list-logs command. Use this command to view the job log file names of your training job. This is similar to using the Preview drop-down menu in the GUI to view and select your job log file names. You will need to specify the following:

  • The project that contains, or is assigned to, the job you wish to view the job log file names.

  • The name of the job you wish to view the job log file names.

Example snapi job list-logs command
$ snapi job list-logs \
   --project <project-name> \
   --job <job-name>
train-0fb0568c-ca8e-4771-b7cf-e6ef156d1347-1-ncc9n-runner.log
train-0fb0568c-ca8e-4771-b7cf-e6ef156d1347-1-ncc9n-model.log

Run snapi job list-logs --help to display additional usage and options.

Preview a log file

After you have viewed the log file names for your training job, you can use the snapi job preview-log command to preview the logs corresponding to a selected log file. The example below demonstrates the command. You will need to specify the following:

  • The project that contains, or is assigned to, the job you wish to preview the job log file.

  • The name of the job you wish to preview the job log file.

  • The job log file name you wish to preview its logs. This file name is returned by running the snapi job list-logs command, which is described above.

Example snapi job preview-log command
$ snapi job preview-log \
   --project <project-name> \
   --job <job-name> \
   --file train-0fb0568c-ca8e-4771-b7cf-e6ef156d1347-1-ncc9n-runner.log
2023-08-10 20:28:46  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  Runner starting...

2023-08-10 20:28:46  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  Runner successfully started

2023-08-10 20:28:46  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  Received new train request

2023-08-10 20:28:46  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  Connecting to modelbox at localhost:50061

2023-08-10 20:28:54  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  Running training

2023-08-10 20:28:54  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  Staging dataset

2023-08-10 20:28:54  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  initializing metrics for modelbox:0

2023-08-10 20:28:54  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  initializing checkpoint path for modelbox:0

2023-08-10 20:31:35  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  Preparing training for modelbox:0

2023-08-10 20:31:35  -  INFO  -  20be599f-9ea7-44ea-9dc5-b97294d97529  -  Running training for modelbox

Run snapi job preview-log --help to display additional usage and options.

Download the logs

Use the snapi download-logs command to download a compressed file of your training job’s logs. The example below demonstrates the command. You will need to provide the following:

  • The project that contains, or is assigned to, the job you wish to download the compressed log file.

  • The name of the job you wish to download the compressed log file.

Example snapi download-logs command
$ snapi job download-logs \
   --project <project-name>> \
   --job <job-name>
Successfully Downloaded: <job-name> logs

The default destination for the compressed file download is the current directory. To specify a destination directory, use the --dest option. Run snapi job download-logs --help for more information and to display additional usage and options.