Generate synthetic sequences for use in classification#

This explains the use of generate_synthetic.py. Generates synthetic sequences given a fasta file.

Source data#

Any fasta file can be used.

Results#

Note

Entry points are available if this is installed using the automated conda method. You can then use the command line argument directly, for example: create_dataset_bio. If not, you will need to use the script directly, which follows the same naming pattern, for example: python create_dataset_bio.py.

Running the code as below:

python generate_synthetic.py \
  path/to/infile.fa \
  -o path/to/outfile.fa

You will obtain a fasta file with synthetic sequences generated according to your settings. By default, dinucleotide frequency is calculated for each sequence and used to generate a corresponding null sequence. Reverse complement is possible if needed. This can be used in two-step classification in cases where you do not have a control set.

Notes#

The input file can be provided in gzip format. However, output will be a plain text file as sequences are read and written line by line.

Usage#

python generate_synthetic.py -h
usage: generate_synthetic.py [-h] [-b BLOCK_SIZE] [-c CONTROL_DIST] [-o OUTFILE]
                             [--do_reverse_complement]
                             infile_path

Take fasta files, generate synthetic sequences. Accepts .gz files.

positional arguments:
  infile_path           path to fasta/gz file

options:
  -h, --help            show this help message and exit
  -b BLOCK_SIZE, --block_size BLOCK_SIZE
                        size of block to generate synthetic sequences from as
                        negative control (DEFAULT: 2)
  -c CONTROL_DIST, --control_dist CONTROL_DIST
                        generate control distribution by [ bootstrap | frequency
                        | /path/to/file ] (DEFAULT: frequency)
  -o OUTFILE, --outfile OUTFILE
                        write synthetic sequences (DEFAULT: "out.fa")
  --do_reverse_complement
                        turn on reverse complement (DEFAULT: OFF)