Generative data preparation

The generative data preparation package describes how to prepare data to be used to train SambaNova’s generative NLP text models. It offers an efficient way to convert input text files into tokenized sequences that are packed into a fixed sequence length. The resulting output directory can be directly used for training. This package features many styles of packing text of any length into tokenized sequences, compressed hdf5 file outputs, efficient multiprocessing, shuffling any sized dataset, splitting your data into train/dev/test, and specifying what tokens are attended to during training.

Accessing the package

To access the generative data preparation package, including its associated documentation, please visit the Generative data preparation repo on the SambaNova public GitHub using the following link:
https://github.com/sambanova/generative_data_prep

Figure 1. Generative data preparation repo on SambaNova public GitHub