Get class attribution for deep learning models#

This explains the use of interpret.py for deep learning through genomicBERT.

Source data#

Source data is a path to a trained pytorch classifier model directory OR a wandb run.

Results#

Note

Entry points are available if this is installed using the automated conda method. You can then use the command line argument directly, for example: create_dataset_bio. If not, you will need to use the script directly, which follows the same naming pattern, for example: python create_dataset_bio.py.

Running the code as below:

Deep learning#

Input sequences can be provided as multiple strings and/or fasta files. If a string is provided, the file name will be the first 16 characters of the string followed by a unique string. If a fasta file is provided, the file name(s) will be the fasta header. Label names must be sorted in the order of labels, eg category 1, category 2.

python interpret.py <MODEL_PATH> <INPUT_SEQS ...> [TOKENISER_PATH] [OUTPUT_DIR] [LABEL_NAMES ...]

Notes#

More information on transformers interpretability is available here.

Usage#

genomicBERT: Deep learning#

Sequences to test for class attribution can be provided directly or as fasta files.

python interpret.py -h
usage: interpret.py [-h] [-t TOKENISER_PATH] [-o OUTPUT_DIR] [-l LABEL_NAMES [LABEL_NAMES ...]]
                    model_path input_seqs [input_seqs ...]

Take complete classifier and calculate feature attributions.

positional arguments:
    model_path            path to local model directory OR wandb run
    input_seqs            input sequence(s) directly and/or fasta files

optional arguments:
    -h, --help            show this help message and exit
    -t TOKENISER_PATH, --tokeniser_path TOKENISER_PATH
                            path to tokeniser.json file to load data from
    -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                            specify path for output (DEFAULT: ./interpret_out)
    -l LABEL_NAMES [LABEL_NAMES ...], --label_names LABEL_NAMES [LABEL_NAMES ...]
                            provide label names matching order (DEFAULT: None).