Perform a hyperparameter sweep#

This explains the use of for machine and deep learning through genomicBERT. If you already know what hyperparameters are needed, you can use For conventional machine learning, the sweep, train and cross validation steps are combined in one operation.

Source data#

Source data is a HuggingFace dataset object as a csv, json or parquet file. Specify --format accordingly. csv only for non-deep learning.



Entry points are available if this is installed using the automated conda method. You can then use the command line argument directly, for example: create_dataset_bio. If not, you will need to use the script directly, which follows the same naming pattern, for example: python

Running the code as below:

Deep learning#

python <TRAIN_DATA> <FORMAT> <TOKENISER_PATH> --test TEST_DATA --valid VALIDATION_DATA --hyperparameter_sweep PARAMS.JSON --entity_name WANDB_ENTITY_NAME --project_name WANDB_PROJECT_NAME --group_name WANDB_GROUP_NAME --sweep_count N --metric_opt [ eval/accuracy | eval/validation | eval/loss | eval/precision | eval/recall ] --output_dir OUTPUT_DIR

Frequency-based approaches#

python -i [INFILE_PATH ... ] --format "csv" -t TOKENISER_PATH --freq_method [ cvec | tfidf ] --model [ rf | xg ] --kfolds N --sweep_count N --metric_opt [ accuracy | f1 | precision | recall | roc_auc ] --output_dir OUTPUT_DIR


python -i [INFILE_PATH ... ] --format "csv" -t TOKENISER_PATH --freq_method [ cvec | tfidf ] --model [ rf | xg ] --kfolds N --sweep_count N --metric_opt [ accuracy | f1 | precision | recall | roc_auc ] --output_dir OUTPUT_DIR


The original documentation to specify training arguments is available here.


genomicBERT: Deep learning#

Sweep parameters and search space should be passed in as a json file.

Example hyperparameter.json file
  "name" : "random",
  "method" : "random",
  "metric": {
    "name": "eval/f1",
    "goal": "maximize"
  "parameters" : {
    "epochs" : {
      "values" : [1, 2, 3]
    "batch_size": {
        "values": [8, 16, 32, 64]
    "learning_rate" :{
      "distribution": "log_uniform_values",
      "min": 0.0001,
      "max": 0.1
    "weight_decay": {
        "values": [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
  "early_terminate": {
      "type": "hyperband",
      "s": 2,
      "eta": 3,
      "max_iter": 27
usage: [-h] [-t TEST] [-v VALID] [-m MODEL]
                [--model_features MODEL_FEATURES] [-o OUTPUT_DIR] [-d DEVICE]
                [-s VOCAB_SIZE] [-w HYPERPARAMETER_SWEEP]
                [-l LABEL_NAMES [LABEL_NAMES ...]] [-n SWEEP_COUNT]
                [-e ENTITY_NAME] [-p PROJECT_NAME] [-g GROUP_NAME]
                [-c METRIC_OPT] [-r RESUME_SWEEP] [--fp16_off] [--wandb_off]
                train format tokeniser_path

Take HuggingFace dataset and perform parameter sweeping.

positional arguments:
  train                 path to [ csv | csv.gz | json | parquet ] file
  format                specify input file type [ csv | json | parquet ]
  tokeniser_path        path to tokeniser.json file to load data from

  -h, --help            show this help message and exit
  -t TEST, --test TEST  path to [ csv | csv.gz | json | parquet ] file
  -v VALID, --valid VALID
                        path to [ csv | csv.gz | json | parquet ] file
  -m MODEL, --model MODEL
                        choose model [ distilbert | longformer ] distilbert
                        handles shorter sequences up to 512 tokens longformer
                        handles longer sequences up to 4096 tokens (DEFAULT:
  --model_features MODEL_FEATURES
                        number of features in data to use (DEFAULT: ALL)
                        NOTE: this is separate from the vocab_size argument.
                        under normal circumstances (eg a tokeniser generated
                        by tokenise_bio), setting this is not necessary
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        specify path for output (DEFAULT: ./sweep_out)
  -d DEVICE, --device DEVICE
                        choose device [ cpu | cuda:0 ] (DEFAULT: detect)
  -s VOCAB_SIZE, --vocab_size VOCAB_SIZE
                        vocabulary size for model configuration
                        run a hyperparameter sweep with config from file
                        provide column with label names (DEFAULT: "").
  -n SWEEP_COUNT, --sweep_count SWEEP_COUNT
                        run n hyperparameter sweeps (DEFAULT: 64)
  -e ENTITY_NAME, --entity_name ENTITY_NAME
                        provide wandb team name (if available).
  -p PROJECT_NAME, --project_name PROJECT_NAME
                        provide wandb project name (if available).
  -g GROUP_NAME, --group_name GROUP_NAME
                        provide wandb group name (if desired).
  METRIC_OPT, --metric_opt METRIC_OPT
                        score to maximise [ eval/accuracy | eval/validation |
                        eval/loss | eval/precision | eval/recall ] (DEFAULT:
  -r RESUME_SWEEP, --resume_sweep RESUME_SWEEP
                        provide sweep id to resume sweep.
  --fp16_off            turn fp16 off for precision / cpu (DEFAULT: ON)
  --wandb_off           run hyperparameter tuning using the wandb api and log
                        training in real time online (DEFAULT: ON)

Frequency based approach#

python -h
usage: [-h] [--infile_path INFILE_PATH [INFILE_PATH ...]]
                        [--format FORMAT] [--embeddings EMBEDDINGS]
                        [--chunk_size CHUNK_SIZE] [-t TOKENISER_PATH]
                        [-f FREQ_METHOD] [--column_names COLUMN_NAMES]
                        [--column_name COLUMN_NAME] [-m MODEL]
                        [-e MODEL_FEATURES] [-k KFOLDS]
                        [--ngram_from NGRAM_FROM] [--ngram_to NGRAM_TO]
                        [--split_train SPLIT_TRAIN] [--split_test SPLIT_TEST]
                        [--split_val SPLIT_VAL] [-o OUTPUT_DIR]
                        [-s VOCAB_SIZE]
                        [--special_tokens SPECIAL_TOKENS [SPECIAL_TOKENS ...]]
                        [-w HYPERPARAMETER_SWEEP]
                        [--sweep_method SWEEP_METHOD] [-n SWEEP_COUNT]
                        [-c METRIC_OPT] [-j NJOBS] [-d PRE_DISPATCH]

Take HuggingFace dataset and perform parameter sweeping.

  -h, --help            show this help message and exit
  --infile_path INFILE_PATH [INFILE_PATH ...]
                        path to [ csv | csv.gz | json | parquet ] file
  --format FORMAT       specify input file type [ csv | json | parquet ]
  --embeddings EMBEDDINGS
                        path to embeddings model file
  --chunk_size CHUNK_SIZE
                        iterate over input file for these many rows
  -t TOKENISER_PATH, --tokeniser_path TOKENISER_PATH
                        path to tokeniser.json file to load data from
  -f FREQ_METHOD, --freq_method FREQ_METHOD
                        choose dist [ cvec | tfidf ] (DEFAULT: tfidf)
  --column_names COLUMN_NAMES
                        column name for sp tokenised data (DEFAULT:
  --column_name COLUMN_NAME
                        column name for extracting embeddings (DEFAULT:
  -m MODEL, --model MODEL
                        choose model [ rf | xg ] (DEFAULT: rf)
  -e MODEL_FEATURES, --model_features MODEL_FEATURES
                        number of features in data to use (DEFAULT: ALL)
  -k KFOLDS, --kfolds KFOLDS
                        number of cross validation folds (DEFAULT: 8)
  --ngram_from NGRAM_FROM
                        ngram slice starting index (DEFAULT: 1)
  --ngram_to NGRAM_TO   ngram slice ending index (DEFAULT: 1)
  --split_train SPLIT_TRAIN
                        proportion of training data (DEFAULT: 0.90)
  --split_test SPLIT_TEST
                        proportion of testing data (DEFAULT: 0.05)
  --split_val SPLIT_VAL
                        proportion of validation data (DEFAULT: 0.05)
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        specify path for output (DEFAULT: ./results_out)
  -s VOCAB_SIZE, --vocab_size VOCAB_SIZE
                        vocabulary size for model configuration
  --special_tokens SPECIAL_TOKENS [SPECIAL_TOKENS ...]
                        assign special tokens, eg space and pad tokens
                        (DEFAULT: ["<s>", "</s>", "<unk>", "<pad>",
                        run a hyperparameter sweep with config from file
  --sweep_method SWEEP_METHOD
                        specify sweep search strategy [ bayes | grid | random
                        ] (DEFAULT: random)
  -n SWEEP_COUNT, --sweep_count SWEEP_COUNT
                        run n hyperparameter sweeps (DEFAULT: 8)
  -c METRIC_OPT, --metric_opt METRIC_OPT
                        score to maximise [ accuracy | f1 | precision |
                        recall ] (DEFAULT: f1)
  -j NJOBS, --njobs NJOBS
                        run on n threads (DEFAULT: -1)
  -d PRE_DISPATCH, --pre_dispatch PRE_DISPATCH
                        specify dispatched jobs (DEFAULT: 0.5*n_jobs)

Embedding based approach#

python -h
usage: [-h]
                             [--infile_path INFILE_PATH [INFILE_PATH ...]]
                             [--format FORMAT] [--embeddings EMBEDDINGS]
                             [--chunk_size CHUNK_SIZE] [-t TOKENISER_PATH]
                             [-f FREQ_METHOD] [--column_names COLUMN_NAMES]
                             [--column_name COLUMN_NAME] [-m MODEL]
                             [-e MODEL_FEATURES] [-k KFOLDS]
                             [--ngram_from NGRAM_FROM] [--ngram_to NGRAM_TO]
                             [--split_train SPLIT_TRAIN]
                             [--split_test SPLIT_TEST]
                             [--split_val SPLIT_VAL] [-o OUTPUT_DIR]
                             [-s VOCAB_SIZE]
                             [--special_tokens SPECIAL_TOKENS [SPECIAL_TOKENS ...]]
                             [-w HYPERPARAMETER_SWEEP]
                             [--sweep_method SWEEP_METHOD] [-n SWEEP_COUNT]
                             [-c METRIC_OPT] [-j NJOBS] [-d PRE_DISPATCH]

Take HuggingFace dataset and perform parameter sweeping.

  -h, --help            show this help message and exit
  --infile_path INFILE_PATH [INFILE_PATH ...]
                        path to [ csv | csv.gz | json | parquet ] file
  --format FORMAT       specify input file type [ csv | json | parquet ]
  --embeddings EMBEDDINGS
                        path to embeddings model file
  --chunk_size CHUNK_SIZE
                        iterate over input file for these many rows
  -t TOKENISER_PATH, --tokeniser_path TOKENISER_PATH
                        path to tokeniser.json file to load data from
  -f FREQ_METHOD, --freq_method FREQ_METHOD
                        choose dist [ embed ] (DEFAULT: embed)
  --column_names COLUMN_NAMES
                        column name for sp tokenised data (DEFAULT:
  --column_name COLUMN_NAME
                        column name for extracting embeddings (DEFAULT:
  -m MODEL, --model MODEL
                        choose model [ rf | xg ] (DEFAULT: rf)
  -e MODEL_FEATURES, --model_features MODEL_FEATURES
                        number of features in data to use (DEFAULT: ALL)
  -k KFOLDS, --kfolds KFOLDS
                        number of cross validation folds (DEFAULT: 8)
  --ngram_from NGRAM_FROM
                        ngram slice starting index (DEFAULT: 1)
  --ngram_to NGRAM_TO   ngram slice ending index (DEFAULT: 1)
  --split_train SPLIT_TRAIN
                        proportion of training data (DEFAULT: 0.90)
  --split_test SPLIT_TEST
                        proportion of testing data (DEFAULT: 0.05)
  --split_val SPLIT_VAL
                        proportion of validation data (DEFAULT: 0.05)
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        specify path for output (DEFAULT: ./results_out)
  -s VOCAB_SIZE, --vocab_size VOCAB_SIZE
                        vocabulary size for model configuration
  --special_tokens SPECIAL_TOKENS [SPECIAL_TOKENS ...]
                        assign special tokens, eg space and pad tokens
                        (DEFAULT: ["<s>", "</s>", "<unk>", "<pad>",
                        run a hyperparameter sweep with config from file
  --sweep_method SWEEP_METHOD
                        specify sweep search strategy [ bayes | grid | random
                        ] (DEFAULT: random)
  -n SWEEP_COUNT, --sweep_count SWEEP_COUNT
                        run n hyperparameter sweeps (DEFAULT: 8)
  -c METRIC_OPT, --metric_opt METRIC_OPT
                        score to maximise [ accuracy | f1 | precision |
                        recall ] (DEFAULT: f1)
  -j NJOBS, --njobs NJOBS
                        run on n threads (DEFAULT: -1)
  -d PRE_DISPATCH, --pre_dispatch PRE_DISPATCH
                        specify dispatched jobs (DEFAULT: 0.5*n_jobs)