utils module¶
- utils.bootstrap_seq(seq: str, block_size: int = 2)¶
Take a string and reshuffle it in blocks of N length.
Shuffles a sequence in the user-defined block size. Joins the sequence back together at the end.
Compare
generate_from_freq().- Parameters:
seq (str) – A string of biological sequence data.
block_size (int) – An integer specifying the size of block to shuffle.
- Returns:
A reshuffled string of the same length as the original input
Input:
ACGTOutput:
GTACIf the reconstructed seq exceeds seq length it will be truncated.
- Return type:
str
- utils.build_kmers(sequence: str, ksize: int) str¶
Generator that takes a fasta sequence and kmer size to return kmers
- Parameters:
sequence (str) – an instance of a dna sequence.
ksize (int) – size of the k-mer
- Returns:
Individual k-mers from the input sequence. If you want to control the sliding window size, you can slice the resulting output of this, e.g.
i for i in build_kmers(‘ACTGACTGA’, 3)] [‘ACG’, ‘CGT’, ‘GTA’, ‘TAC’, ‘ACG’, ‘CGT’, ‘GTA’] i for i in build_kmers(‘ACTGACTGA’, 3)][::3] [‘ACG’, ‘GAC’, ‘GTA’]
- Return type:
str
- utils.calculate_auc(run, group_name=None)¶
Calculate AUC for a wandb run. This assumes you logged a ROC curve.
- Parameters:
eval_preds (wandb.Run) – an instance of a wandb.Run.
group_name (str) – a label for the specified group name
- Returns:
A pandas.DataFrame containing AUC scores per class.
- Return type:
pandas.DataFrame
- utils.chunk_text(infile_path: str, outfile_path: str, title: str, labels: str, content: str, chunk: int = 512)¶
Take a csv-like file of text, process and stream to csv-like file.
- Parameters:
infile_path (str) – A path to a file containing natural language data
outfile_path (str) – A path to a file containing the output
title (str) – Title of column containing titles (can be an identifier)
labels (str) – Title of column containing labels
content (str) – Title of column containing content
chunk (int) – Chunk the data into seqs of n length (DEFAULT: 512)
- Returns:
The file is written directly to disk and the sequences are not returned.
Input:
/path/to/infile /path/to/outfile title labels content chunk_sizeOutput:
NoneNote that this is specific for natural language data and will not work on biological sequences directly (which have specific formatting). Here we assume there are the columns: index, title, content, labels.
- Return type:
None
- utils.csv_to_hf(infile_neg: str, infile_pos: str, outfile_path: str)¶
Add hf formatting to an existing csv-like file and stream to csv-like file. Used downstream of
process_seqs().- Parameters:
infile_neg (str) – Path to file containing negative / condition 0 data
infile_pos (str) – Path to file containing positive / condition 1 data
outfile_path (str) – Write huggingface dataset compatible output
- Returns:
The file is written directly to disk and the sequences are not returned.
Input:
/path/to/infile_one /path/to/infile_two /path/to/outputOutput:
NoneThis is intended to be used after
process_seqs(). If used directly, it may not work as intended as some things are hardcoded.- Return type:
None
- utils.dataset_to_disk(dataset: Dataset, outfile_dir: str, name: str)¶
Take a 🤗 dataset object, path as output and write files to disk
- Parameters:
dataset (Dataset) – A
HuggingFacedatasets.Datasetobjectoutfile_dir (str) – Write the dataset files to this path
name (str) – The name of the split, ie
train,test,validation. The file names will correspond to these. Validation set is optional.
- Returns:
Nothing is returned, this writes files directly to
outfile_dir.This is normally called by
split_datasets()but can be used directly if needed. Files are written directly to disk in multiple formats for use in downstream operations, e.g. model training.- Return type:
None
- utils.embed_seqs_kmers(infile_path: str, ksize: int = 5, slide: int = 1, rc: bool = True, chunk: int | None = None, outfile_path: str | None = None)¶
Take a file of biological sequences, process and stream to generator. Calls
build_kmers()andreverse_complement(). Used to generate word2vec embeddings.- Parameters:
infile_path (str) – A path to a file containing biological sequence data
ksize (int) – size of the k-mer (DEFAULT: 5)
slide (int) – size of the sliding window (DEFAULT: 1) If you want no sliding to be performed, set slide equal to ksize
rc (bool) – reverse complement the data (DEFAULT: TRUE)
chunk (int) – chunk the data into seqs of n length (DEFAULT: None)
outfile_path (str) – A path to outfile (DEFAULT: None)
- Returns:
Sequences are returned as a generator object for input into word2vec
Input:
/path/to/infileOutput:
listNote that no sequence cleaning is performed, ‘N’ gets mapped to itself. Uppercase is assumed. Does not work on RNA!
- Return type:
list
- utils.embed_seqs_sp(infile_path: str, outfile_path: str, chunksize: int = 1, tokeniser_path: str | None = None, special_tokens: list = ['<s>', '</s>', '<unk>', '<pad>', '<mask>'], columns: list = ['idx', 'feature', 'labels', 'input_ids', 'token_type_ids', 'attention_mask', 'input_str'], column: str = 'input_str', labels: str | None = None)¶
Take a file of SP tokenised sequences, process and stream to generator. Used to generate word2vec embeddings. See also
parse_sp_tokenised().- Parameters:
infile_path (str) – Path to
csvfile containing tokenised data.outfile_path (str) – Path to
csvfile containing tokenised data.chunksize (int) – How many rows of the dataframe to iterate at a time.
tokeniser_path (str) – Path to sequence tokens file (from
SentencePiece)special_tokens (list[str]) – Special tokens to substitute for. This should match the list of special tokens used in the original tokeniser (which defaults to the five special tokens shown here).
columns (list) – List of column headings (in infile_path)
column (str) – The column header with the input_str (to extract tokens)
labels (str) – If specified, return label column (to extract tokens)
- Returns:
Sequences are returned as a generator object for input into word2vec
Input:
/path/to/infileOutput:
list- Return type:
list
- utils.generate_from_freq(seq: str, block_size: int = 2, alphabet: list = ['A', 'C', 'G', 'T'], offset: float = 0.01)¶
Take a string and sample from freq distribution to fill up seq length.
Compare
bootstrap_seq().- Parameters:
seq (str) – A string of biological sequence data
block_size (int) – Size of block to shuffle
alphabet (list[str]) – Biological alphabet present in input sequences
offset (float) – Adding offset avoids 0 division errors in small datasets
- Returns:
Resampled sequence with matching frequency distribution of the same length as the original input. Frequency distribution is sampled as n-length blocks (eg:
[AA, AC, ..]or[AAA, AAC, ...]).Input:
AAAACGTOutput:
ACGTAAAIf the reconstructed seq exceeds seq length it will be truncated.
- Return type:
str
- utils.get_feature_importance_mdi(clf, features, model_type, show_features: int = 50, output_dir: str = '.') Series¶
Calculate feature importance by Gini scores. This is more effective when there are fewer classes. See also
get_feature_importance_per().- Parameters:
clf (sklearn.ensemble) – a trained sklearn tree-like model.
features (np.ndarray) – the output of get_feature_names_out.
model_type (str) – Random Forest “rf” or XGBoost “xg”.
show_features (int) – number of features to plot (text export unaffected)
output_dir (str) – figure and list of feature importances go here.
- Returns:
pandas Series object with feature importance scores mapped to features.
- Return type:
pd.Series
- utils.get_feature_importance_per(clf, x_test, y_test, features, model_type, show_features: int = 50, output_dir: str = '.', n_repeats: int = 10, n_jobs: int = 1) Series¶
Calculate feature importance by permutation. This tests feature importance in the context of the model only. See also
get_feature_importance_mdi().- Parameters:
clf (sklearn.ensemble) – a trained sklearn tree-like model.
x_test (np.ndarray) – test data.
y_test (np.ndarray) – test labels.
features (np.ndarray) – the output of get_feature_names_out.
show_features (int) – number of features to plot (text export unaffected)
output_dir (str) – figure and list of feature importances go here.
n_repeats (int) – number of repeats for the permutation to run.
n_jobs (int) – number of threads for the permutation to run on.
- Returns:
pandas Series object with feature importance scores mapped to features.
- Return type:
pd.Series
- utils.get_run_metrics(runs, group_name=None)¶
Get metrics for the specified runs as a pandas.DataFrame
This does not directly obtain the runs, you will need to call wandb.Api first and specify the runs you want before passing them into here.
- Parameters:
runs (wandb.Api.runs) – a wandb.Api.runs() object
group_name (str) – a label for the specified group name
- Returns:
Writes the metrics obtained from wandb.Api.runs directly to disk.
- Return type:
pandas.DataFrame
- utils.get_tokens_from_sp(tokeniser_path: str, special_tokens: list = ['<s>', '</s>', '<unk>', '<pad>', '<mask>'])¶
Take path to
SentencePiecetokeniser + special tokens, return tokensThe input
tokeniser_pathis ajsonfile generated from theHuggingFaceimplementation ofSentencePiece. Compareparse_sp_tokenised().- Parameters:
tokeniser_path (str) – Path to sequence tokens file (from
SentencePiece)special_tokens (list[str]) – Special tokens to substitute for. This should match the list of special tokens used in the original tokeniser (which defaults to the five special tokens shown here).
- Returns:
A list of cleaned tokens corresponding to variable length k-mers.
- Return type:
list
- utils.html_to_pdf(infile_path: str, outfile_path: str | None = None, options: dict | None = None)¶
Convert the output of transformers interpret to pdf and write to disk.
- Parameters:
infile_path (str) – path to transformers-interpret html output
outfile_path (str) – path to transformers-interpret pdf output
options (dict) – html to pdf conversion options
- Returns:
Both pdfkit and wkhtmltopdf are required. Mainly used with interpret. Please refer to https://github.com/JazzCore/python-pdfkit:
import pdfkit pdfkit.from_file("input.html", "output.pdf", options={...})
- Return type:
None
- utils.load_args_cmd(args)¶
Helper function to load a HfArgumentParser into TrainingArguments
Loads a HfArgumentParser class of arguments into a transformers.training_args.TrainingArguments object.
- Parameters:
args (class) – A HfArgumentParser object
- Returns:
For more information please refer to the huggingface documentation directly: https://huggingface.co/docs/transformers/v4.23.1/en/main_classes/trainer#transformers.TrainingArguments
- Return type:
transformers.training_args.TrainingArguments
- utils.load_args_json(args_json: str)¶
Helper function to load a json file into TrainingArguments
Loads a json file of arguments into a transformers.training_args.TrainingArguments object.
- Parameters:
args_json (str) – Path to json file with training arguments
- Returns:
For more information please refer to the huggingface documentation directly: https://huggingface.co/docs/transformers/v4.23.1/en/main_classes/trainer#transformers.TrainingArguments
- Return type:
transformers.training_args.TrainingArguments
- utils.parse_sp_tokenised(infile_path: str, outfile_path: str, tokeniser_path: str | None = None, special_tokens: list = ['<s>', '</s>', '<unk>', '<pad>', '<mask>'], chunksize: int = 100, columns: list = ['idx', 'feature', 'labels', 'input_ids', 'token_type_ids', 'attention_mask', 'input_str'])¶
Extract entries tokenised by SentencePiece into a pandas.DataFrame object
The input
infile_pathis acsvfile containing tokenised data as positional ordinal encodings. The data should have been tokenised using theHuggingFaceimplementation ofSentencePiece. Writes file to disk. Compareget_tokens_from_sp(). See alsoembed_seqs_sp().- Parameters:
infile_path (str) – Path to
csvfile containing tokenised data.outfile_path (str) – Path to
csvfile containing tokenised data.tokeniser_path (str) – Path to sequence tokens file (from
SentencePiece)special_tokens (list[str]) – Special tokens to substitute for. This should match the list of special tokens used in the original tokeniser (which defaults to the five special tokens shown here).
chunksize (int) – How many rows of the dataframe to iterate at a time.
columns (list) – List of column headings
- Returns:
The
pandas.DataFramecontains the contents of thecsvfile, but numeric columns are correctly formatted asnumpy.array. Theremap_fileargument is useful if you want to extract the k-mers directly for use in different workflows.- Return type:
None
- utils.plot_hist(compare: list, outfile_path: str | None = None)¶
Plot histogram of alphas. Writes plot directly to disk. Also see
plot_scatter()- Parameters:
compare (list[pd.DataFrame]) – Paths to pandas dataframes with model info
outfile_path (str) – Write the plot to this path
- Returns:
Smaller alpha is better [2, 4]. Computer Vision best models are ~2. If at least 1 layer has a score approaching 0, this indicates scale collapse. NLP models in the HuggingFace
transformerslibrary are deliberately overparameterised as they are intended as a base for fine tuning and are not a complete model. You will see values of [2, 6] before these are fine tuned, this is expected behaviour.If you want to compare your models against existing ones in HuggingFace as a quick comparison, you can download a model to disk, substituting out your model of interest as needed in the example below, then you can pass the path to the model as an argument to
compare:from transformers import DistilBertModel model = DistilBertModel.from_pretrained('distilbert-base-uncased') model.save_pretrained("/path/to/distilbert")
- Return type:
None
- utils.plot_scatter(compare: list, outfile_path: str | None = None)¶
Plot scatterplot of alphas. Writes plot directly to disk. Also see
plot_hist()- Parameters:
compare (list[pd.DataFrame]) – Paths to pandas dataframes with model info
outfile_path (str) – Write the plot to this path
- Returns:
Smaller alpha is better [2, 4]. Computer Vision best models are ~2. If at least 1 layer has a score approaching 0, this indicates scale collapse. NLP models in the HuggingFace
transformerslibrary are deliberately overparameterised as they are intended as a base for fine tuning and are not a complete model. You will see values of [2, 6] before these are fine tuned, this is expected behaviour.If you want to compare your models against existing ones in HuggingFace as a quick comparison, you can download a model to disk, substituting out your model of interest as needed in the example below, then you can pass the path to the model as an argument to
compare:from transformers import DistilBertModel model = DistilBertModel.from_pretrained('distilbert-base-uncased') model.save_pretrained("/path/to/distilbert")
- Return type:
None
- utils.plot_token_dist(tokeniser_path: str, special_tokens: list = ['<s>', '</s>', '<unk>', '<pad>', '<mask>'], outfile_dir: str = './')¶
Plot distribution of token lengths. Calls
get_tokens_from_sp()The input
tokeniser_pathis ajsonfile generated from theHuggingFaceimplementation ofSentencePiece.- Parameters:
tokeniser_path (str) – Path to sequence tokens file (from
SentencePiece)special_tokens (list[str]) – Special tokens to substitute for
outfile_dir (str) – Path to output plots
- Returns:
Token histogram plots are written to
outfile_dirinpngandpdfformats.- Return type:
matplotlib.pyplot
- utils.process_seqs(infile_path: str, outfile_path: str, rc: bool = True, chunk: int | None = None)¶
Take a file of biological sequences, process and stream to csv-like file. Calls
reverse_complement(). Used beforecsv_to_hf().- Parameters:
infile_path (str) – A path to a file containing biological sequence data
outfile_path (str) – A path to a file containing the output
rc (bool) – reverse complement the data (DEFAULT: TRUE)
chunk (int) – chunk the data into seqs of n length (DEFAULT: None)
- Returns:
The file is written directly to disk and the sequences are not returned.
Input:
/path/to/infileOutput:
NoneNote that no sequence cleaning is performed, ‘N’ gets mapped to itself. Uppercase is assumed. Does not work on RNA!
- Return type:
None
- utils.remove_stopwords(dataset: str, column: str | None = None, highmem: bool = True)¶
Remove English language stopwords from text. Stopwords are obtained from
SpaCy 3.2.4.- Parameters:
dataset (str) – A path to a comma separated
.csvfilecolumn (str) – The name of the column to be cleaned. If no column text is provided (default), parses all columns. This option is disabled if highmem is set to
False!highmem (bool) – If
True(default), usespandasto operate on the file. IfFalse, parses the file line by line, overriding column selection!
- Returns:
New file path with removed stopwords, named
dataset.CLEAN. Note that stopwords with leading uppercase are also removed. For example “the” and “The” will be treated the same and removed. To obtain the stopwords list for English used in this function:#!/bin/bash python -m spacy download en #!/usr/bin/python import spacy sp = spacy.load('en_core_web_sm') stopwords_en = sp.Defaults.stop_words
- Return type:
str
- utils.reverse_complement(dna: str)¶
Take a nucleic acid string as input and return reverse complement.
- Parameters:
dna (str) – A string of nucleic acid sequence data.
- Returns:
Reverse complemented DNA/RNA string.
Input:
ACGTOutput:
TGCANote that no sequence cleaning is performed, ‘N’ gets mapped to itself. Uppercase is assumed. If U is detected, automatically assume RNA! Supports letters YRKMSW. BDHV get converted to N!.
- Return type:
str
- utils.split_datasets(dataset: DatasetDict, outfile_dir: str, train: float, test: float = 0, val: float = 0, shuffle: bool = False)¶
Split data into training | testing | validation sets
- Parameters:
dataset (DatasetDict) – A
HuggingFaceDatasetDictobjectoutfile_dir (str) – Write the dataset files to this path
train (float) – Proportion of dataset for training
test (float) – Proportion of dataset for testing
val (float) – Proportion of dataset for validation
shuffle (bool) – Shuffle the dataset before splitting
- Returns:
Returns a
datasets.DatasetDictobject with correspondingtrain | test | validsplits. Writes files tooutfile_dir.Specifying the validation set is optional. However, note that train + test + validation proportions must sum to 1! This calls
dataset_to_disk()to write files to disk. File names will match the corresponding split:train | test | valid- Return type:
DatasetDict