Generate synthetic sequences for use in classification#
This explains the use of generate_synthetic.py
. Generates synthetic sequences given a fasta
file.
Source data#
Any fasta
file can be used.
Results#
Note
Entry points are available if this is installed using the automated conda method. You can then use the command line argument directly, for example: create_dataset_bio
. If not, you will need to use the script directly, which follows the same naming pattern, for example: python create_dataset_bio.py
.
Running the code as below:
python generate_synthetic.py \
path/to/infile.fa \
-o path/to/outfile.fa
You will obtain a fasta
file with synthetic sequences generated according to your settings. By default, dinucleotide frequency is calculated for each sequence and used to generate a corresponding null sequence. Reverse complement is possible if needed. This can be used in two-step classification in cases where you do not have a control set.
Notes#
The input file can be provided in gzip
format. However, output will be a plain text
file as sequences are read and written line by line.
Usage#
python generate_synthetic.py -h
usage: generate_synthetic.py [-h] [-b BLOCK_SIZE] [-c CONTROL_DIST] [-o OUTFILE]
[--do_reverse_complement]
infile_path
Take fasta files, generate synthetic sequences. Accepts .gz files.
positional arguments:
infile_path path to fasta/gz file
options:
-h, --help show this help message and exit
-b BLOCK_SIZE, --block_size BLOCK_SIZE
size of block to generate synthetic sequences from as
negative control (DEFAULT: 2)
-c CONTROL_DIST, --control_dist CONTROL_DIST
generate control distribution by [ bootstrap | frequency
| /path/to/file ] (DEFAULT: frequency)
-o OUTFILE, --outfile OUTFILE
write synthetic sequences (DEFAULT: "out.fa")
--do_reverse_complement
turn on reverse complement (DEFAULT: OFF)