Dataset Hub
SambaStudio provides several commonly used datasets to train your models. Additionally, you can add your own datasets and view information for the available datasets in the platform.
This document describes how to add, delete and view datasets using the SambaStudio Dataset Hub GUI and the SambaNova API (snapi) CLI.
All paths used in SambaStudio are relative paths to the storage root directory |
Add a dataset using the GUI
The Dataset Hub provides an interface for managing datasets by displaying a detailed list of datasets. The My datasets, Shared datasets, and SambaNova checkboxes filter the dataset list by their respective group. The ML App drop-down filters the dataset list by the corresponding ML App.
There are three options for adding datasets to the platform using the GUI. For all three options, you will first need to:
-
Click Datasets from the left menu to navigate to the Dataset Hub window.
-
Click the Add dataset button.
Figure 1. Dataset Hub-
The Add a dataset window will open.
-
-
In the Dataset name field, input a name for your dataset.
-
From the Job type dropdown, select whether the dataset is to be used for Train/Evaluation or Batch inference.
-
The Share settings drop-down provides options for which tenant to share your dataset.
-
Share with <current-tenant> allows the dataset to be shared with the current tenant you are using, identified by its name in the drop-down.
-
Share with all tenants allows the dataset to be shared across all tenants.
-
Dataset will be shared with all users in <current-tenant> identifies that the dataset will be shared with other users in the tenant you are using.
If the Dataset will be shared with all users in <current-tenant> option is displayed, the Share with <current-tenant> and Share with all tenants options described above will not be available. Share with all tenants is an optional feature of SambaStudio. Please contact your administrator or SambaNova representative for more information.
-
-
From the Applicable ML Apps drop-down, select the ML App(s) that you wish the dataset to be associated. Multiple ML Apps can be selected.
Be sure to select appropriate ML Apps that correspond with your dataset, as the platform will not warn you of ML Apps selected that do not correspond with your dataset.
Figure 2. Add a dataset
Option 1: Upload from a local machine
Follow the steps below to upload a dataset from a local directory on your machine.
The recommended maximum dataset size for uploading from a local machine is 5 gigabytes (GB). |
-
Select Local storage from the Source drop-down.
-
Select the Upload new files radio button.
-
Navigate to the folder on your local machine by using Choose directory.
-
Click the Add dataset button to complete the operation.

Option 2: Upload from NFS using existing files
Follow the steps below to upload a dataset from NFS.
-
Select Local storage from the Source drop-down.
-
Select the Use existing files radio button.
-
In the Dataset path field, provide the relative path to the storage root directory
<NFS_root>
where the dataset is located.

Option 3: Import from AWS S3
Follow the steps below to import your dataset from AWS S3.
|
-
Select AWS from the Source drop-down.
-
In the Bucket field, input the name of your S3 bucket.
-
Input the relative path to the dataset in the S3 bucket into the Folder field. This folder should include the required dataset files for the task (for example, the labels, training, and validation files).
-
In the Access key ID field, input the unique ID provided by AWS IAM to manage access.
-
Enter your Secret access key into the field. This allows authentication access for the provided Access Key ID.
-
Enter the AWS Region that your S3 bucket resides into the Region field.
An Access key, Secret access key, and user access permissions are required for AWS S3 import.
Figure 5. Import from AWS S3
Add a dataset using the CLI
The example commands below demonstrate how to add a dataset using the SambaNova API (snapi) CLI.
Upload the dataset
The dataset will need to be uploaded to the NFS server using the following path. Ensure the permissions of the dataset directory are set to 755
. Contact your administrator for more information.
<NFS_root>/daasdir/<user-directory>/<datasetdir>/
An example implementation of the above path is
|
Dataset add command
The example below demonstrates how to add a dataset to the Dataset Hub using the snapi add dataset
command. The following can be specified:
-
The name of the dataset.
-
The path to the dataset.
-
The App (ML App) the dataset will be associated. Multiple Apps can be specified. Run
snapi app list
to display a list of the available apps. -
The field of application that the dataset will be used, such as
language
. -
The language of an NLP dataset, such as
english
. -
A metadata
.yaml
filename. This optional metadata file must be a.json
or.yaml
file. -
The type of job for the dataset by specifying either
train
&evaluation
orbatch_predict
.
$ snapi dataset add \
--dataset-name <your-dataset-name> \
--dataset-path 'user1/<directory-name>' \
-apps '<name-of-ML-App>' \
--application_field <type of app> \
--language <specific language> \
--metadata-file <file-name>.yaml \
--job_type train
When running the |
Validation, train, and test datasets
The example below demonstrates a metadata file used for validation, train, and test datasets.
$ cat dc_data.yaml
validation_filepath: validation.csv
train_filepath: train.csv
labels_file: labels_file.txt
test_filepath: test.csv
Delete a dataset using the GUI
Follow the steps below to delete a dataset using the SambaStudio GUI.
-
From the Dataset Hub window, click the three dots under the Actions column for the dataset you wish to delete.
Figure 6. Delete dataset actions menu-
The You are about to delete a dataset box will open. A warning message will display informing you that you are about to delete a dataset.
-
-
Click Yes to confirm that you want to delete the dataset.
Figure 7. Delete dataset box
View a list of datasets using the CLI
Run the snapi dataset list
command to view the list of datasets by name. The example below shows the GPT_13B_Training_Dataset
and GPT_1.5B_Training_Dataset
datasets and their associated attributes.
$ snapi dataset list
GPT_13B_Training_Dataset
========================
PATH : common/datasets/squad_clm/ggt_2048/hdf5
APPS : ['57f6a3c8-1f04-488a-bb39-3cfc5b4a5d7a']
USER : None
STATUS : Available
TIME CREATED : 2023-01-16T00:00:00
GPT_1.5B_Training_Dataset
=========================
PATH : common/datasets/ggt_sentiment_analysis/hdf5_single_avoid_overflow/hdf5
APPS : ['e681c226-86be-40b2-9380-d2de11b19842']
USER : None
STATUS : Available
TIME CREATED : 2021-08-26T00:00:00
View information for a dataset using the CLI
Run the snapi dataset info
command to view detailed information for a specific dataset, including its Dataset ID
. The example below shows detailed information for the GPT_13B_Training_Dataset
dataset.
$ snapi dataset info \
--dataset GPT_13B_Training_Dataset
Dataset Info
============
Dataset ID : 894dd158-9552-11ed-a1eb-0242ac120002
Name : GPT_13B_Training_Dataset
Path : common/datasets/squad_clm/ggt_2048/hdf5
Apps : ['57f6a3c8-1f04-488a-bb39-3cfc5b4a5d7a']
Created Time : 2023-01-16T00:00:00
Metadata : None
Dataset Source : SambaStudio
Status : Available
Job Type : ['train']
Field of Application : language
Description : A SambaNova curated collection of datasets that cover Q&A and structured data
File Size (MB) : 1.0
Delete a dataset using the CLI
The example below demonstrates how to delete a dataset using the snapi dataset remove
command. You will need to specify the dataset name or dataset id.
$ snapi dataset remove \
--dataset <your-dataset-name> OR <your-dataset-id>