Generate synthetic sequences for use in classification

This explains the use of generate_synthetic.py. Generates synthetic sequences given a fasta file.

Source data

Any fasta file can be used.

Results

Note

Entry points are available if this is installed using the automated conda method. You can then use the command line argument directly, for example: create_dataset_bio. If not, you will need to use the script directly, which follows the same naming pattern, for example: python create_dataset_bio.py.

Running the code as below:

python generate_synthetic.py \
  path/to/infile.fa \
  -o path/to/outfile.fa

You will obtain a fasta file with synthetic sequences generated according to your settings. By default, dinucleotide frequency is calculated for each sequence and used to generate a corresponding null sequence. Reverse complement is possible if needed. This can be used in two-step classification in cases where you do not have a control set.

Notes

The input file can be provided in gzip format. However, output will be a plain text file as sequences are read and written line by line.

Usage

python generate_synthetic.py -h
usage: generate_synthetic.py [-h] [-b BLOCK_SIZE] [-c CONTROL_DIST] [-o OUTFILE]
                             [--do_reverse_complement]
                             infile_path

Take fasta files, generate synthetic sequences. Accepts .gz files.

positional arguments:
  infile_path           path to fasta/gz file

options:
  -h, --help            show this help message and exit
  -b BLOCK_SIZE, --block_size BLOCK_SIZE
                        size of block to generate synthetic sequences from as
                        negative control (DEFAULT: 2)
  -c CONTROL_DIST, --control_dist CONTROL_DIST
                        generate control distribution by [ bootstrap | frequency
                        | /path/to/file ] (DEFAULT: frequency)
  -o OUTFILE, --outfile OUTFILE
                        write synthetic sequences (DEFAULT: "out.fa")
  --do_reverse_complement
                        turn on reverse complement (DEFAULT: OFF)