Generate synthetic sequences for use in classification¶
This explains the use of generate_synthetic.py. Generates synthetic sequences given a fasta file.
Source data¶
Any fasta file can be used.
Results¶
Note
Entry points are available if this is installed using the automated conda method. You can then use the command line argument directly, for example: create_dataset_bio. If not, you will need to use the script directly, which follows the same naming pattern, for example: python create_dataset_bio.py.
Running the code as below:
python generate_synthetic.py \
path/to/infile.fa \
-o path/to/outfile.fa
You will obtain a fasta file with synthetic sequences generated according to your settings. By default, dinucleotide frequency is calculated for each sequence and used to generate a corresponding null sequence. Reverse complement is possible if needed. This can be used in two-step classification in cases where you do not have a control set.
Notes¶
The input file can be provided in gzip format. However, output will be a plain text file as sequences are read and written line by line.
Usage¶
python generate_synthetic.py -h
usage: generate_synthetic.py [-h] [-b BLOCK_SIZE] [-c CONTROL_DIST] [-o OUTFILE]
[--do_reverse_complement]
infile_path
Take fasta files, generate synthetic sequences. Accepts .gz files.
positional arguments:
infile_path path to fasta/gz file
options:
-h, --help show this help message and exit
-b BLOCK_SIZE, --block_size BLOCK_SIZE
size of block to generate synthetic sequences from as
negative control (DEFAULT: 2)
-c CONTROL_DIST, --control_dist CONTROL_DIST
generate control distribution by [ bootstrap | frequency
| /path/to/file ] (DEFAULT: frequency)
-o OUTFILE, --outfile OUTFILE
write synthetic sequences (DEFAULT: "out.fa")
--do_reverse_complement
turn on reverse complement (DEFAULT: OFF)