Perform a hyperparameter sweep#

This explains the use of sweep.py for machine and deep learning through genomicBERT. If you already know what hyperparameters are needed, you can use train_model.py. For conventional machine learning, the sweep, train and cross validation steps are combined in one operation.

Source data#

Source data is a HuggingFace dataset object as a csv, json or parquet file. Specify --format accordingly. csv only for non-deep learning.

Results#

Note

Entry points are available if this is installed using the automated conda method. You can then use the command line argument directly, for example: create_dataset_bio. If not, you will need to use the script directly, which follows the same naming pattern, for example: python create_dataset_bio.py.

Running the code as below:

Deep learning#

python sweep.py <TRAIN_DATA> <FORMAT> <TOKENISER_PATH> --test TEST_DATA --valid VALIDATION_DATA --hyperparameter_sweep PARAMS.JSON --entity_name WANDB_ENTITY_NAME --project_name WANDB_PROJECT_NAME --group_name WANDB_GROUP_NAME --sweep_count N --metric_opt [ eval/accuracy | eval/validation | eval/loss | eval/precision | eval/recall ] --output_dir OUTPUT_DIR

Frequency-based approaches#

python freq_pipeline.py -i [INFILE_PATH ... ] --format "csv" -t TOKENISER_PATH --freq_method [ cvec | tfidf ] --model [ rf | xg ] --kfolds N --sweep_count N --metric_opt [ accuracy | f1 | precision | recall | roc_auc ] --output_dir OUTPUT_DIR

Embedding#

python embedding_pipeline.py -i [INFILE_PATH ... ] --format "csv" -t TOKENISER_PATH --freq_method [ cvec | tfidf ] --model [ rf | xg ] --kfolds N --sweep_count N --metric_opt [ accuracy | f1 | precision | recall | roc_auc ] --output_dir OUTPUT_DIR

Notes#

The original documentation to specify training arguments is available here.

Usage#

genomicBERT: Deep learning#

Sweep parameters and search space should be passed in as a json file.

Example hyperparameter.json file

{
  "name" : "random",
  "method" : "random",
  "metric": {
    "name": "eval/f1",
    "goal": "maximize"
  },
  "parameters" : {
    "epochs" : {
      "values" : [1, 2, 3]
    },
    "batch_size": {
        "values": [8, 16, 32, 64]
        },
    "learning_rate" :{
      "distribution": "log_uniform_values",
      "min": 0.0001,
      "max": 0.1
      },
    "weight_decay": {
        "values": [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
      }
  },
  "early_terminate": {
      "type": "hyperband",
      "s": 2,
      "eta": 3,
      "max_iter": 27
  }
}

usage: sweep.py [-h] [-t TEST] [-v VALID] [-m MODEL]
                [--model_features MODEL_FEATURES] [-o OUTPUT_DIR] [-d DEVICE]
                [-s VOCAB_SIZE] [-w HYPERPARAMETER_SWEEP]
                [-l LABEL_NAMES [LABEL_NAMES ...]] [-n SWEEP_COUNT]
                [-e ENTITY_NAME] [-p PROJECT_NAME] [-g GROUP_NAME]
                [-c METRIC_OPT] [-r RESUME_SWEEP] [--fp16_off] [--wandb_off]
                train format tokeniser_path

Take HuggingFace dataset and perform parameter sweeping.

positional arguments:
  train                 path to [ csv | csv.gz | json | parquet ] file
  format                specify input file type [ csv | json | parquet ]
  tokeniser_path        path to tokeniser.json file to load data from

options:
  -h, --help            show this help message and exit
  -t TEST, --test TEST  path to [ csv | csv.gz | json | parquet ] file
  -v VALID, --valid VALID
                        path to [ csv | csv.gz | json | parquet ] file
  -m MODEL, --model MODEL
                        choose model [ distilbert | longformer ] distilbert
                        handles shorter sequences up to 512 tokens longformer
                        handles longer sequences up to 4096 tokens (DEFAULT:
                        distilbert)
  --model_features MODEL_FEATURES
                        number of features in data to use (DEFAULT: ALL)
                        NOTE: this is separate from the vocab_size argument.
                        under normal circumstances (eg a tokeniser generated
                        by tokenise_bio), setting this is not necessary
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        specify path for output (DEFAULT: ./sweep_out)
  -d DEVICE, --device DEVICE
                        choose device [ cpu | cuda:0 ] (DEFAULT: detect)
  -s VOCAB_SIZE, --vocab_size VOCAB_SIZE
                        vocabulary size for model configuration
  -w HYPERPARAMETER_SWEEP, --hyperparameter_sweep HYPERPARAMETER_SWEEP
                        run a hyperparameter sweep with config from file
  -l LABEL_NAMES [LABEL_NAMES ...], --label_names LABEL_NAMES [LABEL_NAMES ...]
                        provide column with label names (DEFAULT: "").
  -n SWEEP_COUNT, --sweep_count SWEEP_COUNT
                        run n hyperparameter sweeps (DEFAULT: 64)
  -e ENTITY_NAME, --entity_name ENTITY_NAME
                        provide wandb team name (if available).
  -p PROJECT_NAME, --project_name PROJECT_NAME
                        provide wandb project name (if available).
  -g GROUP_NAME, --group_name GROUP_NAME
                        provide wandb group name (if desired).
  METRIC_OPT, --metric_opt METRIC_OPT
                        score to maximise [ eval/accuracy | eval/validation |
                        eval/loss | eval/precision | eval/recall ] (DEFAULT:
                        eval/f1)
  -r RESUME_SWEEP, --resume_sweep RESUME_SWEEP
                        provide sweep id to resume sweep.
  --fp16_off            turn fp16 off for precision / cpu (DEFAULT: ON)
  --wandb_off           run hyperparameter tuning using the wandb api and log
                        training in real time online (DEFAULT: ON)

Frequency based approach#

python freq_pipeline.py -h
usage: freq_pipeline.py [-h] [--infile_path INFILE_PATH [INFILE_PATH ...]]
                        [--format FORMAT] [--embeddings EMBEDDINGS]
                        [--chunk_size CHUNK_SIZE] [-t TOKENISER_PATH]
                        [-f FREQ_METHOD] [--column_names COLUMN_NAMES]
                        [--column_name COLUMN_NAME] [-m MODEL]
                        [-e MODEL_FEATURES] [-k KFOLDS]
                        [--ngram_from NGRAM_FROM] [--ngram_to NGRAM_TO]
                        [--split_train SPLIT_TRAIN] [--split_test SPLIT_TEST]
                        [--split_val SPLIT_VAL] [-o OUTPUT_DIR]
                        [-s VOCAB_SIZE]
                        [--special_tokens SPECIAL_TOKENS [SPECIAL_TOKENS ...]]
                        [-w HYPERPARAMETER_SWEEP]
                        [--sweep_method SWEEP_METHOD] [-n SWEEP_COUNT]
                        [-c METRIC_OPT] [-j NJOBS] [-d PRE_DISPATCH]

Take HuggingFace dataset and perform parameter sweeping.

options:
  -h, --help            show this help message and exit
  --infile_path INFILE_PATH [INFILE_PATH ...]
                        path to [ csv | csv.gz | json | parquet ] file
  --format FORMAT       specify input file type [ csv | json | parquet ]
  --embeddings EMBEDDINGS
                        path to embeddings model file
  --chunk_size CHUNK_SIZE
                        iterate over input file for these many rows
  -t TOKENISER_PATH, --tokeniser_path TOKENISER_PATH
                        path to tokeniser.json file to load data from
  -f FREQ_METHOD, --freq_method FREQ_METHOD
                        choose dist [ cvec | tfidf ] (DEFAULT: tfidf)
  --column_names COLUMN_NAMES
                        column name for sp tokenised data (DEFAULT:
                        input_str)
  --column_name COLUMN_NAME
                        column name for extracting embeddings (DEFAULT:
                        input_str)
  -m MODEL, --model MODEL
                        choose model [ rf | xg ] (DEFAULT: rf)
  -e MODEL_FEATURES, --model_features MODEL_FEATURES
                        number of features in data to use (DEFAULT: ALL)
  -k KFOLDS, --kfolds KFOLDS
                        number of cross validation folds (DEFAULT: 8)
  --ngram_from NGRAM_FROM
                        ngram slice starting index (DEFAULT: 1)
  --ngram_to NGRAM_TO   ngram slice ending index (DEFAULT: 1)
  --split_train SPLIT_TRAIN
                        proportion of training data (DEFAULT: 0.90)
  --split_test SPLIT_TEST
                        proportion of testing data (DEFAULT: 0.05)
  --split_val SPLIT_VAL
                        proportion of validation data (DEFAULT: 0.05)
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        specify path for output (DEFAULT: ./results_out)
  -s VOCAB_SIZE, --vocab_size VOCAB_SIZE
                        vocabulary size for model configuration
  --special_tokens SPECIAL_TOKENS [SPECIAL_TOKENS ...]
                        assign special tokens, eg space and pad tokens
                        (DEFAULT: ["<s>", "</s>", "<unk>", "<pad>",
                        "<mask>"])
  -w HYPERPARAMETER_SWEEP, --hyperparameter_sweep HYPERPARAMETER_SWEEP
                        run a hyperparameter sweep with config from file
  --sweep_method SWEEP_METHOD
                        specify sweep search strategy [ bayes | grid | random
                        ] (DEFAULT: random)
  -n SWEEP_COUNT, --sweep_count SWEEP_COUNT
                        run n hyperparameter sweeps (DEFAULT: 8)
  -c METRIC_OPT, --metric_opt METRIC_OPT
                        score to maximise [ accuracy | f1 | precision |
                        recall ] (DEFAULT: f1)
  -j NJOBS, --njobs NJOBS
                        run on n threads (DEFAULT: -1)
  -d PRE_DISPATCH, --pre_dispatch PRE_DISPATCH
                        specify dispatched jobs (DEFAULT: 0.5*n_jobs)

Embedding based approach#

python embedding_pipeline.py -h
usage: embedding_pipeline.py [-h]
                             [--infile_path INFILE_PATH [INFILE_PATH ...]]
                             [--format FORMAT] [--embeddings EMBEDDINGS]
                             [--chunk_size CHUNK_SIZE] [-t TOKENISER_PATH]
                             [-f FREQ_METHOD] [--column_names COLUMN_NAMES]
                             [--column_name COLUMN_NAME] [-m MODEL]
                             [-e MODEL_FEATURES] [-k KFOLDS]
                             [--ngram_from NGRAM_FROM] [--ngram_to NGRAM_TO]
                             [--split_train SPLIT_TRAIN]
                             [--split_test SPLIT_TEST]
                             [--split_val SPLIT_VAL] [-o OUTPUT_DIR]
                             [-s VOCAB_SIZE]
                             [--special_tokens SPECIAL_TOKENS [SPECIAL_TOKENS ...]]
                             [-w HYPERPARAMETER_SWEEP]
                             [--sweep_method SWEEP_METHOD] [-n SWEEP_COUNT]
                             [-c METRIC_OPT] [-j NJOBS] [-d PRE_DISPATCH]

Take HuggingFace dataset and perform parameter sweeping.

options:
  -h, --help            show this help message and exit
  --infile_path INFILE_PATH [INFILE_PATH ...]
                        path to [ csv | csv.gz | json | parquet ] file
  --format FORMAT       specify input file type [ csv | json | parquet ]
  --embeddings EMBEDDINGS
                        path to embeddings model file
  --chunk_size CHUNK_SIZE
                        iterate over input file for these many rows
  -t TOKENISER_PATH, --tokeniser_path TOKENISER_PATH
                        path to tokeniser.json file to load data from
  -f FREQ_METHOD, --freq_method FREQ_METHOD
                        choose dist [ embed ] (DEFAULT: embed)
  --column_names COLUMN_NAMES
                        column name for sp tokenised data (DEFAULT:
                        input_str)
  --column_name COLUMN_NAME
                        column name for extracting embeddings (DEFAULT:
                        input_str)
  -m MODEL, --model MODEL
                        choose model [ rf | xg ] (DEFAULT: rf)
  -e MODEL_FEATURES, --model_features MODEL_FEATURES
                        number of features in data to use (DEFAULT: ALL)
  -k KFOLDS, --kfolds KFOLDS
                        number of cross validation folds (DEFAULT: 8)
  --ngram_from NGRAM_FROM
                        ngram slice starting index (DEFAULT: 1)
  --ngram_to NGRAM_TO   ngram slice ending index (DEFAULT: 1)
  --split_train SPLIT_TRAIN
                        proportion of training data (DEFAULT: 0.90)
  --split_test SPLIT_TEST
                        proportion of testing data (DEFAULT: 0.05)
  --split_val SPLIT_VAL
                        proportion of validation data (DEFAULT: 0.05)
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        specify path for output (DEFAULT: ./results_out)
  -s VOCAB_SIZE, --vocab_size VOCAB_SIZE
                        vocabulary size for model configuration
  --special_tokens SPECIAL_TOKENS [SPECIAL_TOKENS ...]
                        assign special tokens, eg space and pad tokens
                        (DEFAULT: ["<s>", "</s>", "<unk>", "<pad>",
                        "<mask>"])
  -w HYPERPARAMETER_SWEEP, --hyperparameter_sweep HYPERPARAMETER_SWEEP
                        run a hyperparameter sweep with config from file
  --sweep_method SWEEP_METHOD
                        specify sweep search strategy [ bayes | grid | random
                        ] (DEFAULT: random)
  -n SWEEP_COUNT, --sweep_count SWEEP_COUNT
                        run n hyperparameter sweeps (DEFAULT: 8)
  -c METRIC_OPT, --metric_opt METRIC_OPT
                        score to maximise [ accuracy | f1 | precision |
                        recall ] (DEFAULT: f1)
  -j NJOBS, --njobs NJOBS
                        run on n threads (DEFAULT: -1)
  -d PRE_DISPATCH, --pre_dispatch PRE_DISPATCH
                        specify dispatched jobs (DEFAULT: 0.5*n_jobs)