Perform cross-validation

This explains the use of cross_validate.py for deep learning through the genomicBERT pipeline. For conventional machine learning, the sweep, train and cross validation steps are combined in one operation.

Source data

Source data is a HuggingFace dataset object as a csv, json or parquet file. Specify --format accordingly. csv only for non-deep learning.

Results

Note

Entry points are available if this is installed using the automated conda method. You can then use the command line argument directly, for example: create_dataset_bio. If not, you will need to use the script directly, which follows the same naming pattern, for example: python create_dataset_bio.py.

Running the code as below:

Deep learning

Specify the same data, wandb project, entity and group names as used for sweeping or training. Once the best run is identified by the user, passing the run id into --config_from_run will automatically load config of the best run from wandb.

# use the WANDB_ENTITY_NAME, WANDB_PROJECT_NAME and the best run id corresponding to the sweep
# WANDB_GROUP_NAME should be changed to reflect the new category of runs (eg "cval")
python cross_validate.py <TRAIN_DATA> <FORMAT> --test TEST_DATA --valid VALIDATION_DATA --entity_name WANDB_ENTITY_NAME --project_name WANDB_PROJECT_NAME --group_name WANDB_GROUP_NAME --kfolds N --config_from_run WANDB_RUN_ID --output_dir OUTPUT_DIR

Frequency-based approaches

Cross-validation is carried out within the main pipeline:

python freq_pipeline.py -i [INFILE_PATH ... ] --format "csv" -t TOKENISER_PATH --freq_method [ cvec | tfidf ] --model [ rf | xg ] --kfolds N --sweep_count N --metric_opt [ accuracy | f1 | precision | recall | roc_auc ] --output_dir OUTPUT_DIR

Embedding

Cross-validation is carried out within the main pipeline:

python embedding_pipeline.py -i [INFILE_PATH ... ] --format "csv" -t TOKENISER_PATH --freq_method [ cvec | tfidf ] --model [ rf | xg ] --kfolds N --sweep_count N --metric_opt [ accuracy | f1 | precision | recall | roc_auc ] --output_dir OUTPUT_DIR

Notes

The original documentation to specify training arguments is available here.

Usage

Deep learning

Sweep parameters and search space should be passed in as a json file.

python ../src/cross_validate.py -h
usage: cross_validate.py [-h] [--tokeniser_path TOKENISER_PATH] [-t TEST] [-v VALID] [-m MODEL_PATH] [-o OUTPUT_DIR]
                        [-d DEVICE] [-s VOCAB_SIZE] [-f HYPERPARAMETER_FILE] [-l LABEL_NAMES [LABEL_NAMES ...]]
                        [-k KFOLDS] [-e ENTITY_NAME] [-g GROUP_NAME] [-p PROJECT_NAME] [-c CONFIG_FROM_RUN]
                        [-o METRIC_OPT] [--overwrite_output_dir] [--no_shuffle] [--wandb_off]
                        train format

Take HuggingFace dataset and perform cross validation.

positional arguments:
  train                 path to [ csv | csv.gz | json | parquet ] file
  format                specify input file type [ csv | json | parquet ]

optional arguments:
  -h, --help            show this help message and exit
  --tokeniser_path TOKENISER_PATH
                        path to tokeniser.json file to load data from
  -t TEST, --test TEST  path to [ csv | csv.gz | json | parquet ] file
  -v VALID, --valid VALID
                        path to [ csv | csv.gz | json | parquet ] file
  -m MODEL_PATH, --model_path MODEL_PATH
                        path to pretrained model dir. this should contain files such as [ pytorch_model.bin,
                        config.yaml, tokeniser.json, etc ]
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        specify path for output (DEFAULT: ./cval_out)
  -d DEVICE, --device DEVICE
                        choose device [ cpu | cuda:0 ] (DEFAULT: detect)
  -s VOCAB_SIZE, --vocab_size VOCAB_SIZE
                        vocabulary size for model configuration
  -f HYPERPARAMETER_FILE, --hyperparameter_file HYPERPARAMETER_FILE
                        provide training_args.bin or json file of hyperparameters. NOTE: if given, this overrides all
                        HfTrainingArguments! This is overridden by --config_from_run!
  -l LABEL_NAMES [LABEL_NAMES ...], --label_names LABEL_NAMES [LABEL_NAMES ...]
                        provide column with label names (DEFAULT: "").
  -k KFOLDS, --kfolds KFOLDS
                        run n number of kfolds (DEFAULT: 8)
  -e ENTITY_NAME, --entity_name ENTITY_NAME
                        provide wandb team name (if available).
  -g GROUP_NAME, --group_name GROUP_NAME
                        provide wandb group name (if desired).
  -p PROJECT_NAME, --project_name PROJECT_NAME
                        provide wandb project name (if available).
  -c CONFIG_FROM_RUN, --config_from_run CONFIG_FROM_RUN
                        load arguments from existing wandb run. NOTE: if given, this overrides --hyperparameter_file!
  METRIC_OPT, --metric_opt METRIC_OPT
                        score to maximise [ eval/accuracy | eval/validation | eval/loss | eval/precision |
                        eval/recall ] (DEFAULT: eval/f1)
  --overwrite_output_dir
                        override output directory (DEFAULT: OFF)
  --no_shuffle          turn off random shuffling (DEFAULT: SHUFFLE)
  --wandb_off           run hyperparameter tuning using the wandb api and log training in real time online (DEFAULT:
                        ON)

Note

If using the --config_from_run option, note that this inherits the original output directory paths. Make sure you specify a new --output_dir and enable the --overwrite_output_dir flag. This also inherits the device specifications (gpu or cpu).