Generate synthetic sequences for use in classification
======================================================

This explains the use of ``generate_synthetic.py``. Generates synthetic sequences given a ``fasta`` file.

Source data
-----------

Any ``fasta`` file can be used.

Results
-------

.. NOTE::

  Entry points are available if this is installed using the automated conda method. You can then use the command line argument directly, for example: ``create_dataset_bio``. If not, you will need to use the script directly, which follows the same naming pattern, for example: ``python create_dataset_bio.py``.

Running the code as below::

  python generate_synthetic.py \
    path/to/infile.fa \
    -o path/to/outfile.fa

You will obtain a ``fasta`` file with synthetic sequences generated according to your settings. By default, dinucleotide frequency is calculated **for each sequence** and used to generate a corresponding null sequence. Reverse complement is possible if needed. This can be used in two-step classification in cases where you do not have a control set.

Notes
-----

The input file can be provided in ``gzip`` format. However, output will be a plain ``text`` file as sequences are read and written line by line.

Usage
-----

::

  python generate_synthetic.py -h
  usage: generate_synthetic.py [-h] [-b BLOCK_SIZE] [-c CONTROL_DIST] [-o OUTFILE]
                               [--do_reverse_complement]
                               infile_path

  Take fasta files, generate synthetic sequences. Accepts .gz files.

  positional arguments:
    infile_path           path to fasta/gz file

  options:
    -h, --help            show this help message and exit
    -b BLOCK_SIZE, --block_size BLOCK_SIZE
                          size of block to generate synthetic sequences from as
                          negative control (DEFAULT: 2)
    -c CONTROL_DIST, --control_dist CONTROL_DIST
                          generate control distribution by [ bootstrap | frequency
                          | /path/to/file ] (DEFAULT: frequency)
    -o OUTFILE, --outfile OUTFILE
                          write synthetic sequences (DEFAULT: "out.fa")
    --do_reverse_complement
                          turn on reverse complement (DEFAULT: OFF)