Dataset Hub

SambaStudio provides several commonly used datasets to train your models. Additionally, you can add your own datasets and view information for the available datasets in the platform.

This document describes how to add, delete and view datasets using the SambaStudio Dataset Hub GUI and the SambaNova API (snapi) CLI.

All paths used in SambaStudio are relative paths to the storage root directory <NFS_root>. Paths outside storage root cannot be used as SambaStudio does not have access to those directories.

Add a dataset using the GUI

The Dataset Hub provides an interface for managing datasets by displaying a detailed list of datasets. The My datasets, Shared datasets, and SambaNova checkboxes filter the dataset list by their respective group. The ML App drop-down filters the dataset list by the corresponding ML App.

There are three options for adding datasets to the platform using the GUI. For all three options, you will first need to:

  1. Click Datasets from the left menu to navigate to the Dataset Hub window.

  2. Click the Add dataset button.

    Dataset hub
    Figure 1. Dataset Hub
    1. The Add a dataset window will open.

  3. In the Dataset name field, input a name for your dataset.

    Including your dataset size in the name (corresponding to the max_seq_length value used) will help you select the appropriate dataset when creating a training job.

  4. From the Job type dropdown, select whether the dataset is to be used for Train/Evaluation or Batch inference.

  5. The Share settings drop-down provides options for which tenant to share your dataset.

    1. Share with <current-tenant> allows the dataset to be shared with the current tenant you are using, identified by its name in the drop-down.

    2. Share with all tenants allows the dataset to be shared across all tenants.

    3. Dataset will be shared with all users in <current-tenant> identifies that the dataset will be shared with other users in the tenant you are using.

      If the Dataset will be shared with all users in <current-tenant> option is displayed, the Share with <current-tenant> and Share with all tenants options described above will not be available. Share with all tenants is an optional feature of SambaStudio. Please contact your administrator or SambaNova representative for more information.

  6. From the Applicable ML Apps drop-down, select the ML App(s) that you wish the dataset to be associated. Multiple ML Apps can be selected.

    Be sure to select appropriate ML Apps that correspond with your dataset, as the platform will not warn you of ML Apps selected that do not correspond with your dataset.

    Add dataset
    Figure 2. Add a dataset

Option 1: Upload from a local machine

Follow the steps below to upload a dataset from a local directory on your machine.

The recommended maximum dataset size for uploading from a local machine is 5 gigabytes (GB).

  1. Select Local storage from the Source drop-down.

  2. Select the Upload new files radio button.

  3. Navigate to the folder on your local machine by using Choose directory.

  4. Click the Add dataset button to complete the operation.

New files from local
Figure 3. Upload new files from a local machine

Option 2: Upload from NFS using existing files

Follow the steps below to upload a dataset from NFS.

  1. Select Local storage from the Source drop-down.

  2. Select the Use existing files radio button.

  3. In the Dataset path field, provide the relative path to the storage root directory <NFS_root> where the dataset is located.

Existing files NFS
Figure 4. Upload from NFS using existing files

Option 3: Import from AWS S3

Follow the steps below to import your dataset from AWS S3.

  • The dataset is imported from AWS S3 only once during dataset creation.

  • AWS S3 credentials are not stored.

  1. Select AWS from the Source drop-down.

  2. In the Bucket field, input the name of your S3 bucket.

  3. Input the relative path to the dataset in the S3 bucket into the Folder field. This folder should include the required dataset files for the task (for example, the labels, training, and validation files).

  4. In the Access key ID field, input the unique ID provided by AWS IAM to manage access.

  5. Enter your Secret access key into the field. This allows authentication access for the provided Access Key ID.

  6. Enter the AWS Region that your S3 bucket resides into the Region field.

    An Access key, Secret access key, and user access permissions are required for AWS S3 import.

    Import AWS S3
    Figure 5. Import from AWS S3

Add a dataset using the CLI

The example commands below demonstrate how to add a dataset using the SambaNova API (snapi) CLI.

Upload the dataset

The dataset will need to be uploaded to the NFS server using the following path. Ensure the permissions of the dataset directory are set to 755. Contact your administrator for more information.

<NFS_root>/daasdir/<user-directory>/<datasetdir>/

An example implementation of the above path is /home/daasdir/user1/GPT_DC/, where:

  • home = <NFS_root>

  • user1 = <user-directory>

  • GPT_DC = <datasetdir>

Dataset add command

The example below demonstrates how to add a dataset to the Dataset Hub using the snapi add dataset command. The following can be specified:

  • The name of the dataset.

  • The path to the dataset.

  • The App (ML App) the dataset will be associated. Multiple Apps can be specified. Run snapi app list to display a list of the available apps.

  • The field of application that the dataset will be used, such as language.

  • The language of an NLP dataset, such as english.

  • A metadata .yaml filename. This optional metadata file must be a .json or .yaml file.

  • The type of job for the dataset by specifying either train & evaluation or batch_predict.

Example snapi dataset add command
$ snapi dataset add \
    --dataset-name <your-dataset-name> \
    --dataset-path 'user1/<directory-name>' \
    -apps '<name-of-ML-App>' \
    --application_field <type of app> \
    --language <specific language> \
    --metadata-file <file-name>.yaml \
    --job_type train

When running the snapi dataset add command for all GPT language models, ensure that dataset-path points to the output directory from the generative_data_prep command. The output should specifically point to the file path you passed in as the argument for the --output_path flag during data preparation. See the Generative data preparation README for more information.

Validation, train, and test datasets

The example below demonstrates a metadata file used for validation, train, and test datasets.

Example metatdata file for validation, train, and test datasets
$ cat dc_data.yaml

validation_filepath: validation.csv
train_filepath: train.csv
labels_file: labels_file.txt
test_filepath: test.csv

Delete a dataset using the GUI

Follow the steps below to delete a dataset using the SambaStudio GUI.

  1. From the Dataset Hub window, click the three dots under the Actions column for the dataset you wish to delete.

    Delete dataset
    Figure 6. Delete dataset actions menu
    1. The You are about to delete a dataset box will open. A warning message will display informing you that you are about to delete a dataset.

  2. Click Yes to confirm that you want to delete the dataset.

    Delete dataset box
    Figure 7. Delete dataset box

View a list of datasets using the CLI

Run the snapi dataset list command to view the list of datasets by name. The example below shows the GPT_13B_Training_Dataset and GPT_1.5B_Training_Dataset datasets and their associated attributes.

Example snapi dataset list command
$ snapi dataset list

GPT_13B_Training_Dataset
========================

PATH              : common/datasets/squad_clm/ggt_2048/hdf5
APPS              : ['57f6a3c8-1f04-488a-bb39-3cfc5b4a5d7a']
USER              : None
STATUS            : Available
TIME CREATED      : 2023-01-16T00:00:00


GPT_1.5B_Training_Dataset
=========================

PATH              : common/datasets/ggt_sentiment_analysis/hdf5_single_avoid_overflow/hdf5
APPS              : ['e681c226-86be-40b2-9380-d2de11b19842']
USER              : None
STATUS            : Available
TIME CREATED      : 2021-08-26T00:00:00

View information for a dataset using the CLI

Run the snapi dataset info command to view detailed information for a specific dataset, including its Dataset ID. The example below shows detailed information for the GPT_13B_Training_Dataset dataset.

Example snapi dataset info command
$ snapi dataset info \
    --dataset GPT_13B_Training_Dataset

             Dataset Info
             ============
Dataset ID           : 894dd158-9552-11ed-a1eb-0242ac120002
Name                 : GPT_13B_Training_Dataset
Path                 : common/datasets/squad_clm/ggt_2048/hdf5
Apps                 : ['57f6a3c8-1f04-488a-bb39-3cfc5b4a5d7a']
Created Time         : 2023-01-16T00:00:00
Metadata             : None
Dataset Source       : SambaStudio
Status               : Available
Job Type             : ['train']
Field of Application : language
Description          : A SambaNova curated collection of datasets that cover Q&A and structured data
File Size (MB)       : 1.0

Delete a dataset using the CLI

The example below demonstrates how to delete a dataset using the snapi dataset remove command. You will need to specify the dataset name or dataset id.

Example snapi dataset remove command
$ snapi dataset remove \
    --dataset <your-dataset-name> OR <your-dataset-id>