Create a dataset object from sequences#
This explains the use of create_dataset_bio.py
. We generate a HuggingFace
dataset object given a fasta
file containing sequences, a fasta
file containing control sequences, and a pretrained tokeniser
from tokeniser.py
. The dataset can then enter the genomicBERT
pipeline.
Source data#
Any fasta
file can be used, with each fasta
file representing a sequence collection of one category. Sample input data files will be available in data/
. If needed, control data can be generated with generate_synthetic.py
. Tokeniser can be generated with tokenise.py
.
Results#
Note
Entry points are available if this is installed using the automated conda method. You can then use the command line argument directly, for example: create_dataset_bio
. If not, you will need to use the script directly, which follows the same naming pattern, for example: python create_dataset_bio.py
.
Running the code as below:
python create_dataset_bio.py <INFILE_SEQS_1> <INFILE_SEQS_2> <TOKENISER_PATH> -c CHUNK -o OUTFILE_DIR
HuggingFace
-like dataset files will be written to disk. This can be loaded directly into a “conventional” deep learning pipeline.
Notes#
It is possible to split the dataset into chunks of n-length. This is useful when the length of individual sequences become too large to fit in memory. A sequence length of 256-512 units can effectively fit on most modern GPUs. Sequence chunks are treated as independent samples of the same class and no merging of weights is performed in this implementation. Note that create_dataset_bio.py
and create_dataset_nlp.py
workflows are structured differently to account for the differences in conventional biological vs human language corpora, but the processes are conceptually identical.
More information on the HuggingFace 🤗 Dataset
object is available online.
Usage#
python create_dataset_bio.py -h
usage: create_dataset_bio.py [-h] [-o OUTFILE_DIR] [-s SPECIAL_TOKENS [SPECIAL_TOKENS ...]] [-c CHUNK]
[--split_train SPLIT_TRAIN] [--split_test SPLIT_TEST]
[--split_val SPLIT_VAL] [--no_reverse_complement] [--no_shuffle]
infile_path control_dist tokeniser_path
Take control and test fasta files, tokeniser and convert to HuggingFace🤗 dataset object. Fasta files
can be .gz. Sequences are reverse complemented by default.
positional arguments:
infile_path path to fasta/gz file
control_dist supply control seqs
tokeniser_path load tokeniser file
optional arguments:
-h, --help show this help message and exit
-o OUTFILE_DIR, --outfile_dir OUTFILE_DIR
write 🤗 dataset to directory as [ csv | json | parquet | dir/ ] (DEFAULT:
"hf_out/")
-s SPECIAL_TOKENS [SPECIAL_TOKENS ...], --special_tokens SPECIAL_TOKENS [SPECIAL_TOKENS ...]
assign special tokens, eg space and pad tokens (DEFAULT: ["<s>", "</s>",
"<unk>", "<pad>", "<mask>"])
-c CHUNK, --chunk CHUNK
split seqs into n-length blocks (DEFAULT: None)
--split_train SPLIT_TRAIN
proportion of training data (DEFAULT: 0.90)
--split_test SPLIT_TEST
proportion of testing data (DEFAULT: 0.05)
--split_val SPLIT_VAL
proportion of validation data (DEFAULT: 0.05)
--no_reverse_complement
turn off reverse complement (DEFAULT: ON)
--no_shuffle turn off shuffle for data split (DEFAULT: ON)