Create embeddings from a tokenised dataset#

This explains the use of create_embedding_bio_sp.py and create_embedding_bio_kmers.py. Only use this if you plan to use embeddings directly.

Source data#

Use csv files created from either create_dataset_bio.py or kmerise_bio.py.

Results#

Note

Entry points are available if this is installed using the automated conda method. You can then use the command line argument directly, for example: create_dataset_bio. If not, you will need to use the script directly, which follows the same naming pattern, for example: python create_dataset_bio.py.

Empirical tokenisation#

create_embedding_bio_sp.py -i [INFILE_PATH ... ] -t TOKENISER_PATH -o OUTFILE_DIR

Conventional k-mers#

create_embedding_bio_kmers.py -i [INFILE_PATH ... ] -t TOKENISER_PATH  -o OUTFILE_DIR

The resulting output will be used in embedding_pipeline.py.

Notes#

Embeddings are generated for each individual token. For example:

# original seq of category X
AAAAACCCCCTTTTTGGGGG

# split into tokens using desired method
[AAAAA]
[CCCCC]
...

# each token gets projected onto an embedding
[0.1 0.2 0.3 ...]
[0.3 0.4 0.5 ...]
...

Usage#

Empirical tokenisation#

python create_embedding_bio_sp.py -h
usage: create_embedding_bio_sp.py [-h] [-i INFILE_PATH [INFILE_PATH ...]]
                                  [-o OUTPUT_DIR] [-c COLUMN_NAMES]
                                  [-l LABELS] [-x COLUMN_NAME] [-m MODEL]
                                  [-t TOKENISER_PATH]
                                  [-s SPECIAL_TOKENS [SPECIAL_TOKENS ...]]
                                  [-n NJOBS] [--w2v_min_count W2V_MIN_COUNT]
                                  [--w2v_sg W2V_SG]
                                  [--w2v_vector_size W2V_VECTOR_SIZE]
                                  [--w2v_window W2V_WINDOW]
                                  [--no_reverse_complement]
                                  [--sample_seq SAMPLE_SEQ]

Take fasta files, tokeniser and generate embedding. Fasta files can be .gz.
Sequences are reverse complemented by default.

options:
  -h, --help            show this help message and exit
  -i INFILE_PATH [INFILE_PATH ...], --infile_path INFILE_PATH [INFILE_PATH ...]
                        path to fasta/gz file
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        write embeddings to disk (DEFAULT: "embed/")
  -c COLUMN_NAMES, --column_names COLUMN_NAMES
                        column name for sp tokenised data (DEFAULT:
                        input_str)
  -l LABELS, --labels LABELS
                        column name for data labels (DEFAULT: labels)
  -x COLUMN_NAME, --column_name COLUMN_NAME
                        column name for extracting embeddings (DEFAULT:
                        input_str)
  -m MODEL, --model MODEL
                        load existing model (DEFAULT: None)
  -t TOKENISER_PATH, --tokeniser_path TOKENISER_PATH
                        load tokenised data
  -s SPECIAL_TOKENS [SPECIAL_TOKENS ...], --special_tokens SPECIAL_TOKENS [SPECIAL_TOKENS ...]
                        assign special tokens, eg space and pad tokens
                        (DEFAULT: ["<s>", "</s>", "<unk>", "<pad>",
                        "<mask>"])
  -n NJOBS, --njobs NJOBS
                        set number of threads to use
  --w2v_min_count W2V_MIN_COUNT
                        set minimum count for w2v (DEFAULT: 1)
  --w2v_sg W2V_SG       0 for bag-of-words, 1 for skip-gram (DEFAULT: 1)
  --w2v_vector_size W2V_VECTOR_SIZE
                        set w2v matrix dimensions (DEFAULT: 100)
  --w2v_window W2V_WINDOW
                        set context window size for w2v (DEFAULT: -/+10)
  --no_reverse_complement
                        turn off reverse complement (DEFAULT: ON)
  --sample_seq SAMPLE_SEQ
                        project sample sequence on embedding (DEFAULT: None)

Conventional k-mers#

python create_embedding_bio_kmers.py -h
usage: create_embedding_bio_kmers.py [-h] [-i INFILE_PATH [INFILE_PATH ...]]
                                     [-o OUTPUT_DIR] [-m MODEL] [-k KSIZE]
                                     [-w SLIDE] [-c CHUNK] [-n NJOBS]
                                     [-s SAMPLE_SEQ] [-v VOCAB_SIZE]
                                     [--w2v_min_count W2V_MIN_COUNT]
                                     [--w2v_sg W2V_SG]
                                     [--w2v_vector_size W2V_VECTOR_SIZE]
                                     [--w2v_window W2V_WINDOW]
                                     [--no_reverse_complement]

Take tokenised data, parameters and generate embedding. Note that this takes
output of kmerise_bio.py, and NOT raw fasta files.

options:
  -h, --help            show this help message and exit
  -i INFILE_PATH [INFILE_PATH ...], --infile_path INFILE_PATH [INFILE_PATH ...]
                        path to input tokenised data file
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        write embeddings to disk (DEFAULT: "embed/")
  -m MODEL, --model MODEL
                        load existing model (DEFAULT: None)
  -k KSIZE, --ksize KSIZE
                        set size of k-mers
  -w SLIDE, --slide SLIDE
                        set length of sliding window on k-mers (min 1)
  -c CHUNK, --chunk CHUNK
                        split seqs into n-length blocks (DEFAULT: None)
  -n NJOBS, --njobs NJOBS
                        set number of threads to use
  -s SAMPLE_SEQ, --sample_seq SAMPLE_SEQ
                        set sample sequence to test model (DEFAULT: None)
  -v VOCAB_SIZE, --vocab_size VOCAB_SIZE
                        vocabulary size for model config (DEFAULT: all)
  --w2v_min_count W2V_MIN_COUNT
                        set minimum count for w2v (DEFAULT: 1)
  --w2v_sg W2V_SG       0 for bag-of-words, 1 for skip-gram (DEFAULT: 1)
  --w2v_vector_size W2V_VECTOR_SIZE
                        set w2v matrix dimensions (DEFAULT: 100)
  --w2v_window W2V_WINDOW
                        set context window size for w2v (DEFAULT: -/+10)
  --no_reverse_complement
                        turn off reverse complement (DEFAULT: ON)