utils module

utils.bootstrap_seq(seq: str, block_size: int = 2)

Take a string and reshuffle it in blocks of N length.

Shuffles a sequence in the user-defined block size. Joins the sequence back together at the end.

Compare generate_from_freq().

Parameters:
  • seq (str) – A string of biological sequence data.

  • block_size (int) – An integer specifying the size of block to shuffle.

Returns:

A reshuffled string of the same length as the original input

Input: ACGT

Output: GTAC

If the reconstructed seq exceeds seq length it will be truncated.

Return type:

str

utils.build_kmers(sequence: str, ksize: int) str

Generator that takes a fasta sequence and kmer size to return kmers

Parameters:
  • sequence (str) – an instance of a dna sequence.

  • ksize (int) – size of the k-mer

Returns:

Individual k-mers from the input sequence. If you want to control the sliding window size, you can slice the resulting output of this, e.g.

i for i in build_kmers(‘ACTGACTGA’, 3)] [‘ACG’, ‘CGT’, ‘GTA’, ‘TAC’, ‘ACG’, ‘CGT’, ‘GTA’] i for i in build_kmers(‘ACTGACTGA’, 3)][::3] [‘ACG’, ‘GAC’, ‘GTA’]

Return type:

str

utils.calculate_auc(run, group_name=None)

Calculate AUC for a wandb run. This assumes you logged a ROC curve.

Parameters:
  • eval_preds (wandb.Run) – an instance of a wandb.Run.

  • group_name (str) – a label for the specified group name

Returns:

A pandas.DataFrame containing AUC scores per class.

Return type:

pandas.DataFrame

utils.chunk_text(infile_path: str, outfile_path: str, title: str, labels: str, content: str, chunk: int = 512)

Take a csv-like file of text, process and stream to csv-like file.

Parameters:
  • infile_path (str) – A path to a file containing natural language data

  • outfile_path (str) – A path to a file containing the output

  • title (str) – Title of column containing titles (can be an identifier)

  • labels (str) – Title of column containing labels

  • content (str) – Title of column containing content

  • chunk (int) – Chunk the data into seqs of n length (DEFAULT: 512)

Returns:

The file is written directly to disk and the sequences are not returned.

Input: /path/to/infile /path/to/outfile title labels content chunk_size

Output: None

Note that this is specific for natural language data and will not work on biological sequences directly (which have specific formatting). Here we assume there are the columns: index, title, content, labels.

Return type:

None

utils.csv_to_hf(infile_neg: str, infile_pos: str, outfile_path: str)

Add hf formatting to an existing csv-like file and stream to csv-like file. Used downstream of process_seqs().

Parameters:
  • infile_neg (str) – Path to file containing negative / condition 0 data

  • infile_pos (str) – Path to file containing positive / condition 1 data

  • outfile_path (str) – Write huggingface dataset compatible output

Returns:

The file is written directly to disk and the sequences are not returned.

Input: /path/to/infile_one /path/to/infile_two /path/to/output

Output: None

This is intended to be used after process_seqs(). If used directly, it may not work as intended as some things are hardcoded.

Return type:

None

utils.dataset_to_disk(dataset: Dataset, outfile_dir: str, name: str)

Take a 🤗 dataset object, path as output and write files to disk

Parameters:
  • dataset (Dataset) – A HuggingFace datasets.Dataset object

  • outfile_dir (str) – Write the dataset files to this path

  • name (str) – The name of the split, ie train, test, validation. The file names will correspond to these. Validation set is optional.

Returns:

Nothing is returned, this writes files directly to outfile_dir.

This is normally called by split_datasets() but can be used directly if needed. Files are written directly to disk in multiple formats for use in downstream operations, e.g. model training.

Return type:

None

utils.embed_seqs_kmers(infile_path: str, ksize: int = 5, slide: int = 1, rc: bool = True, chunk: int | None = None, outfile_path: str | None = None)

Take a file of biological sequences, process and stream to generator. Calls build_kmers() and reverse_complement(). Used to generate word2vec embeddings.

Parameters:
  • infile_path (str) – A path to a file containing biological sequence data

  • ksize (int) – size of the k-mer (DEFAULT: 5)

  • slide (int) – size of the sliding window (DEFAULT: 1) If you want no sliding to be performed, set slide equal to ksize

  • rc (bool) – reverse complement the data (DEFAULT: TRUE)

  • chunk (int) – chunk the data into seqs of n length (DEFAULT: None)

  • outfile_path (str) – A path to outfile (DEFAULT: None)

Returns:

Sequences are returned as a generator object for input into word2vec

Input: /path/to/infile

Output: list

Note that no sequence cleaning is performed, ‘N’ gets mapped to itself. Uppercase is assumed. Does not work on RNA!

Return type:

list

utils.embed_seqs_sp(infile_path: str, outfile_path: str, chunksize: int = 1, tokeniser_path: str | None = None, special_tokens: list = ['<s>', '</s>', '<unk>', '<pad>', '<mask>'], columns: list = ['idx', 'feature', 'labels', 'input_ids', 'token_type_ids', 'attention_mask', 'input_str'], column: str = 'input_str', labels: str | None = None)

Take a file of SP tokenised sequences, process and stream to generator. Used to generate word2vec embeddings. See also parse_sp_tokenised().

Parameters:
  • infile_path (str) – Path to csv file containing tokenised data.

  • outfile_path (str) – Path to csv file containing tokenised data.

  • chunksize (int) – How many rows of the dataframe to iterate at a time.

  • tokeniser_path (str) – Path to sequence tokens file (from SentencePiece)

  • special_tokens (list[str]) – Special tokens to substitute for. This should match the list of special tokens used in the original tokeniser (which defaults to the five special tokens shown here).

  • columns (list) – List of column headings (in infile_path)

  • column (str) – The column header with the input_str (to extract tokens)

  • labels (str) – If specified, return label column (to extract tokens)

Returns:

Sequences are returned as a generator object for input into word2vec

Input: /path/to/infile

Output: list

Return type:

list

utils.generate_from_freq(seq: str, block_size: int = 2, alphabet: list = ['A', 'C', 'G', 'T'], offset: float = 0.01)

Take a string and sample from freq distribution to fill up seq length.

Compare bootstrap_seq().

Parameters:
  • seq (str) – A string of biological sequence data

  • block_size (int) – Size of block to shuffle

  • alphabet (list[str]) – Biological alphabet present in input sequences

  • offset (float) – Adding offset avoids 0 division errors in small datasets

Returns:

Resampled sequence with matching frequency distribution of the same length as the original input. Frequency distribution is sampled as n-length blocks (eg: [AA, AC, ..] or [AAA, AAC, ...]).

Input: AAAACGT

Output: ACGTAAA

If the reconstructed seq exceeds seq length it will be truncated.

Return type:

str

utils.get_feature_importance_mdi(clf, features, model_type, show_features: int = 50, output_dir: str = '.') Series

Calculate feature importance by Gini scores. This is more effective when there are fewer classes. See also get_feature_importance_per().

Parameters:
  • clf (sklearn.ensemble) – a trained sklearn tree-like model.

  • features (np.ndarray) – the output of get_feature_names_out.

  • model_type (str) – Random Forest “rf” or XGBoost “xg”.

  • show_features (int) – number of features to plot (text export unaffected)

  • output_dir (str) – figure and list of feature importances go here.

Returns:

pandas Series object with feature importance scores mapped to features.

Return type:

pd.Series

utils.get_feature_importance_per(clf, x_test, y_test, features, model_type, show_features: int = 50, output_dir: str = '.', n_repeats: int = 10, n_jobs: int = 1) Series

Calculate feature importance by permutation. This tests feature importance in the context of the model only. See also get_feature_importance_mdi().

Parameters:
  • clf (sklearn.ensemble) – a trained sklearn tree-like model.

  • x_test (np.ndarray) – test data.

  • y_test (np.ndarray) – test labels.

  • features (np.ndarray) – the output of get_feature_names_out.

  • show_features (int) – number of features to plot (text export unaffected)

  • output_dir (str) – figure and list of feature importances go here.

  • n_repeats (int) – number of repeats for the permutation to run.

  • n_jobs (int) – number of threads for the permutation to run on.

Returns:

pandas Series object with feature importance scores mapped to features.

Return type:

pd.Series

utils.get_run_metrics(runs, group_name=None)

Get metrics for the specified runs as a pandas.DataFrame

This does not directly obtain the runs, you will need to call wandb.Api first and specify the runs you want before passing them into here.

Parameters:
  • runs (wandb.Api.runs) – a wandb.Api.runs() object

  • group_name (str) – a label for the specified group name

Returns:

Writes the metrics obtained from wandb.Api.runs directly to disk.

Return type:

pandas.DataFrame

utils.get_tokens_from_sp(tokeniser_path: str, special_tokens: list = ['<s>', '</s>', '<unk>', '<pad>', '<mask>'])

Take path to SentencePiece tokeniser + special tokens, return tokens

The input tokeniser_path is a json file generated from the HuggingFace implementation of SentencePiece. Compare parse_sp_tokenised().

Parameters:
  • tokeniser_path (str) – Path to sequence tokens file (from SentencePiece)

  • special_tokens (list[str]) – Special tokens to substitute for. This should match the list of special tokens used in the original tokeniser (which defaults to the five special tokens shown here).

Returns:

A list of cleaned tokens corresponding to variable length k-mers.

Return type:

list

utils.html_to_pdf(infile_path: str, outfile_path: str | None = None, options: dict | None = None)

Convert the output of transformers interpret to pdf and write to disk.

Parameters:
  • infile_path (str) – path to transformers-interpret html output

  • outfile_path (str) – path to transformers-interpret pdf output

  • options (dict) – html to pdf conversion options

Returns:

Both pdfkit and wkhtmltopdf are required. Mainly used with interpret. Please refer to https://github.com/JazzCore/python-pdfkit:

import pdfkit
pdfkit.from_file("input.html", "output.pdf", options={...})

Return type:

None

utils.load_args_cmd(args)

Helper function to load a HfArgumentParser into TrainingArguments

Loads a HfArgumentParser class of arguments into a transformers.training_args.TrainingArguments object.

Parameters:

args (class) – A HfArgumentParser object

Returns:

For more information please refer to the huggingface documentation directly: https://huggingface.co/docs/transformers/v4.23.1/en/main_classes/trainer#transformers.TrainingArguments

Return type:

transformers.training_args.TrainingArguments

utils.load_args_json(args_json: str)

Helper function to load a json file into TrainingArguments

Loads a json file of arguments into a transformers.training_args.TrainingArguments object.

Parameters:

args_json (str) – Path to json file with training arguments

Returns:

For more information please refer to the huggingface documentation directly: https://huggingface.co/docs/transformers/v4.23.1/en/main_classes/trainer#transformers.TrainingArguments

Return type:

transformers.training_args.TrainingArguments

utils.parse_sp_tokenised(infile_path: str, outfile_path: str, tokeniser_path: str | None = None, special_tokens: list = ['<s>', '</s>', '<unk>', '<pad>', '<mask>'], chunksize: int = 100, columns: list = ['idx', 'feature', 'labels', 'input_ids', 'token_type_ids', 'attention_mask', 'input_str'])

Extract entries tokenised by SentencePiece into a pandas.DataFrame object

The input infile_path is a csv file containing tokenised data as positional ordinal encodings. The data should have been tokenised using the HuggingFace implementation of SentencePiece. Writes file to disk. Compare get_tokens_from_sp(). See also embed_seqs_sp().

Parameters:
  • infile_path (str) – Path to csv file containing tokenised data.

  • outfile_path (str) – Path to csv file containing tokenised data.

  • tokeniser_path (str) – Path to sequence tokens file (from SentencePiece)

  • special_tokens (list[str]) – Special tokens to substitute for. This should match the list of special tokens used in the original tokeniser (which defaults to the five special tokens shown here).

  • chunksize (int) – How many rows of the dataframe to iterate at a time.

  • columns (list) – List of column headings

Returns:

The pandas.DataFrame contains the contents of the csv file, but numeric columns are correctly formatted as numpy.array. The remap_file argument is useful if you want to extract the k-mers directly for use in different workflows.

Return type:

None

utils.plot_hist(compare: list, outfile_path: str | None = None)

Plot histogram of alphas. Writes plot directly to disk. Also see plot_scatter()

Parameters:
  • compare (list[pd.DataFrame]) – Paths to pandas dataframes with model info

  • outfile_path (str) – Write the plot to this path

Returns:

Smaller alpha is better [2, 4]. Computer Vision best models are ~2. If at least 1 layer has a score approaching 0, this indicates scale collapse. NLP models in the HuggingFace transformers library are deliberately overparameterised as they are intended as a base for fine tuning and are not a complete model. You will see values of [2, 6] before these are fine tuned, this is expected behaviour.

If you want to compare your models against existing ones in HuggingFace as a quick comparison, you can download a model to disk, substituting out your model of interest as needed in the example below, then you can pass the path to the model as an argument to compare:

from transformers import DistilBertModel
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
model.save_pretrained("/path/to/distilbert")

Return type:

None

utils.plot_scatter(compare: list, outfile_path: str | None = None)

Plot scatterplot of alphas. Writes plot directly to disk. Also see plot_hist()

Parameters:
  • compare (list[pd.DataFrame]) – Paths to pandas dataframes with model info

  • outfile_path (str) – Write the plot to this path

Returns:

Smaller alpha is better [2, 4]. Computer Vision best models are ~2. If at least 1 layer has a score approaching 0, this indicates scale collapse. NLP models in the HuggingFace transformers library are deliberately overparameterised as they are intended as a base for fine tuning and are not a complete model. You will see values of [2, 6] before these are fine tuned, this is expected behaviour.

If you want to compare your models against existing ones in HuggingFace as a quick comparison, you can download a model to disk, substituting out your model of interest as needed in the example below, then you can pass the path to the model as an argument to compare:

from transformers import DistilBertModel
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
model.save_pretrained("/path/to/distilbert")

Return type:

None

utils.plot_token_dist(tokeniser_path: str, special_tokens: list = ['<s>', '</s>', '<unk>', '<pad>', '<mask>'], outfile_dir: str = './')

Plot distribution of token lengths. Calls get_tokens_from_sp()

The input tokeniser_path is a json file generated from the HuggingFace implementation of SentencePiece.

Parameters:
  • tokeniser_path (str) – Path to sequence tokens file (from SentencePiece)

  • special_tokens (list[str]) – Special tokens to substitute for

  • outfile_dir (str) – Path to output plots

Returns:

Token histogram plots are written to outfile_dir in png and pdf formats.

Return type:

matplotlib.pyplot

utils.process_seqs(infile_path: str, outfile_path: str, rc: bool = True, chunk: int | None = None)

Take a file of biological sequences, process and stream to csv-like file. Calls reverse_complement(). Used before csv_to_hf().

Parameters:
  • infile_path (str) – A path to a file containing biological sequence data

  • outfile_path (str) – A path to a file containing the output

  • rc (bool) – reverse complement the data (DEFAULT: TRUE)

  • chunk (int) – chunk the data into seqs of n length (DEFAULT: None)

Returns:

The file is written directly to disk and the sequences are not returned.

Input: /path/to/infile

Output: None

Note that no sequence cleaning is performed, ‘N’ gets mapped to itself. Uppercase is assumed. Does not work on RNA!

Return type:

None

utils.remove_stopwords(dataset: str, column: str | None = None, highmem: bool = True)

Remove English language stopwords from text. Stopwords are obtained from SpaCy 3.2.4.

Parameters:
  • dataset (str) – A path to a comma separated .csv file

  • column (str) – The name of the column to be cleaned. If no column text is provided (default), parses all columns. This option is disabled if highmem is set to False!

  • highmem (bool) – If True (default), uses pandas to operate on the file. If False, parses the file line by line, overriding column selection!

Returns:

New file path with removed stopwords, named dataset.CLEAN. Note that stopwords with leading uppercase are also removed. For example “the” and “The” will be treated the same and removed. To obtain the stopwords list for English used in this function:

#!/bin/bash
python -m spacy download en

#!/usr/bin/python
import spacy
sp = spacy.load('en_core_web_sm')
stopwords_en = sp.Defaults.stop_words

Return type:

str

utils.reverse_complement(dna: str)

Take a nucleic acid string as input and return reverse complement.

Parameters:

dna (str) – A string of nucleic acid sequence data.

Returns:

Reverse complemented DNA/RNA string.

Input: ACGT

Output: TGCA

Note that no sequence cleaning is performed, ‘N’ gets mapped to itself. Uppercase is assumed. If U is detected, automatically assume RNA! Supports letters YRKMSW. BDHV get converted to N!.

Return type:

str

utils.split_datasets(dataset: DatasetDict, outfile_dir: str, train: float, test: float = 0, val: float = 0, shuffle: bool = False)

Split data into training | testing | validation sets

Parameters:
  • dataset (DatasetDict) – A HuggingFace DatasetDict object

  • outfile_dir (str) – Write the dataset files to this path

  • train (float) – Proportion of dataset for training

  • test (float) – Proportion of dataset for testing

  • val (float) – Proportion of dataset for validation

  • shuffle (bool) – Shuffle the dataset before splitting

Returns:

Returns a datasets.DatasetDict object with corresponding train | test | valid splits. Writes files to outfile_dir.

Specifying the validation set is optional. However, note that train + test + validation proportions must sum to 1! This calls dataset_to_disk() to write files to disk. File names will match the corresponding split: train | test | valid

Return type:

DatasetDict