Create a dataset object from sequences#

This explains the use of create_dataset_bio.py. We generate a HuggingFace dataset object given a fasta file containing sequences, a fasta file containing control sequences, and a pretrained tokeniser from tokeniser.py. The dataset can then enter the genomicBERT pipeline.

Source data#

Any fasta file can be used, with each fasta file representing a sequence collection of one category. Sample input data files will be available in data/. If needed, control data can be generated with generate_synthetic.py. Tokeniser can be generated with tokenise.py.

Results#

Note

Entry points are available if this is installed using the automated conda method. You can then use the command line argument directly, for example: create_dataset_bio. If not, you will need to use the script directly, which follows the same naming pattern, for example: python create_dataset_bio.py.

Running the code as below:

python create_dataset_bio.py <INFILE_SEQS_1> <INFILE_SEQS_2> <TOKENISER_PATH> -c CHUNK -o OUTFILE_DIR

HuggingFace-like dataset files will be written to disk. This can be loaded directly into a “conventional” deep learning pipeline.

Notes#

It is possible to split the dataset into chunks of n-length. This is useful when the length of individual sequences become too large to fit in memory. A sequence length of 256-512 units can effectively fit on most modern GPUs. Sequence chunks are treated as independent samples of the same class and no merging of weights is performed in this implementation. Note that create_dataset_bio.py and create_dataset_nlp.py workflows are structured differently to account for the differences in conventional biological vs human language corpora, but the processes are conceptually identical.

More information on the HuggingFace 🤗 Dataset object is available online.

Usage#

python create_dataset_bio.py -h
usage: create_dataset_bio.py [-h] [-o OUTFILE_DIR] [-s SPECIAL_TOKENS [SPECIAL_TOKENS ...]] [-c CHUNK]
                             [--split_train SPLIT_TRAIN] [--split_test SPLIT_TEST]
                             [--split_val SPLIT_VAL] [--no_reverse_complement] [--no_shuffle]
                             infile_path control_dist tokeniser_path

Take control and test fasta files, tokeniser and convert to HuggingFace🤗 dataset object. Fasta files
can be .gz. Sequences are reverse complemented by default.

positional arguments:
  infile_path           path to fasta/gz file
  control_dist          supply control seqs
  tokeniser_path        load tokeniser file

optional arguments:
  -h, --help            show this help message and exit
  -o OUTFILE_DIR, --outfile_dir OUTFILE_DIR
                        write 🤗 dataset to directory as [ csv | json | parquet | dir/ ] (DEFAULT:
                        "hf_out/")
  -s SPECIAL_TOKENS [SPECIAL_TOKENS ...], --special_tokens SPECIAL_TOKENS [SPECIAL_TOKENS ...]
                        assign special tokens, eg space and pad tokens (DEFAULT: ["<s>", "</s>",
                        "<unk>", "<pad>", "<mask>"])
  -c CHUNK, --chunk CHUNK
                        split seqs into n-length blocks (DEFAULT: None)
  --split_train SPLIT_TRAIN
                        proportion of training data (DEFAULT: 0.90)
  --split_test SPLIT_TEST
                        proportion of testing data (DEFAULT: 0.05)
  --split_val SPLIT_VAL
                        proportion of validation data (DEFAULT: 0.05)
  --no_reverse_complement
                        turn off reverse complement (DEFAULT: ON)
  --no_shuffle          turn off shuffle for data split (DEFAULT: ON)