Perform cross-validation¶
This explains the use of cross_validate.py for deep learning through the genomicBERT pipeline. For conventional machine learning, the sweep, train and cross validation steps are combined in one operation.
Source data¶
Source data is a HuggingFace dataset object as a csv, json or parquet file. Specify --format accordingly. csv only for non-deep learning.
Results¶
Note
Entry points are available if this is installed using the automated conda method. You can then use the command line argument directly, for example: create_dataset_bio. If not, you will need to use the script directly, which follows the same naming pattern, for example: python create_dataset_bio.py.
Running the code as below:
Deep learning¶
Specify the same data, wandb project, entity and group names as used for sweeping or training. Once the best run is identified by the user, passing the run id into --config_from_run will automatically load config of the best run from wandb.
# use the WANDB_ENTITY_NAME, WANDB_PROJECT_NAME and the best run id corresponding to the sweep
# WANDB_GROUP_NAME should be changed to reflect the new category of runs (eg "cval")
python cross_validate.py <TRAIN_DATA> <FORMAT> --test TEST_DATA --valid VALIDATION_DATA --entity_name WANDB_ENTITY_NAME --project_name WANDB_PROJECT_NAME --group_name WANDB_GROUP_NAME --kfolds N --config_from_run WANDB_RUN_ID --output_dir OUTPUT_DIR
Frequency-based approaches¶
Cross-validation is carried out within the main pipeline:
python freq_pipeline.py -i [INFILE_PATH ... ] --format "csv" -t TOKENISER_PATH --freq_method [ cvec | tfidf ] --model [ rf | xg ] --kfolds N --sweep_count N --metric_opt [ accuracy | f1 | precision | recall | roc_auc ] --output_dir OUTPUT_DIR
Embedding¶
Cross-validation is carried out within the main pipeline:
python embedding_pipeline.py -i [INFILE_PATH ... ] --format "csv" -t TOKENISER_PATH --freq_method [ cvec | tfidf ] --model [ rf | xg ] --kfolds N --sweep_count N --metric_opt [ accuracy | f1 | precision | recall | roc_auc ] --output_dir OUTPUT_DIR
Notes¶
The original documentation to specify training arguments is available here.
Usage¶
Deep learning¶
Sweep parameters and search space should be passed in as a json file.
python ../src/cross_validate.py -h
usage: cross_validate.py [-h] [--tokeniser_path TOKENISER_PATH] [-t TEST] [-v VALID] [-m MODEL_PATH] [-o OUTPUT_DIR]
[-d DEVICE] [-s VOCAB_SIZE] [-f HYPERPARAMETER_FILE] [-l LABEL_NAMES [LABEL_NAMES ...]]
[-k KFOLDS] [-e ENTITY_NAME] [-g GROUP_NAME] [-p PROJECT_NAME] [-c CONFIG_FROM_RUN]
[-o METRIC_OPT] [--overwrite_output_dir] [--no_shuffle] [--wandb_off]
train format
Take HuggingFace dataset and perform cross validation.
positional arguments:
train path to [ csv | csv.gz | json | parquet ] file
format specify input file type [ csv | json | parquet ]
optional arguments:
-h, --help show this help message and exit
--tokeniser_path TOKENISER_PATH
path to tokeniser.json file to load data from
-t TEST, --test TEST path to [ csv | csv.gz | json | parquet ] file
-v VALID, --valid VALID
path to [ csv | csv.gz | json | parquet ] file
-m MODEL_PATH, --model_path MODEL_PATH
path to pretrained model dir. this should contain files such as [ pytorch_model.bin,
config.yaml, tokeniser.json, etc ]
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
specify path for output (DEFAULT: ./cval_out)
-d DEVICE, --device DEVICE
choose device [ cpu | cuda:0 ] (DEFAULT: detect)
-s VOCAB_SIZE, --vocab_size VOCAB_SIZE
vocabulary size for model configuration
-f HYPERPARAMETER_FILE, --hyperparameter_file HYPERPARAMETER_FILE
provide training_args.bin or json file of hyperparameters. NOTE: if given, this overrides all
HfTrainingArguments! This is overridden by --config_from_run!
-l LABEL_NAMES [LABEL_NAMES ...], --label_names LABEL_NAMES [LABEL_NAMES ...]
provide column with label names (DEFAULT: "").
-k KFOLDS, --kfolds KFOLDS
run n number of kfolds (DEFAULT: 8)
-e ENTITY_NAME, --entity_name ENTITY_NAME
provide wandb team name (if available).
-g GROUP_NAME, --group_name GROUP_NAME
provide wandb group name (if desired).
-p PROJECT_NAME, --project_name PROJECT_NAME
provide wandb project name (if available).
-c CONFIG_FROM_RUN, --config_from_run CONFIG_FROM_RUN
load arguments from existing wandb run. NOTE: if given, this overrides --hyperparameter_file!
METRIC_OPT, --metric_opt METRIC_OPT
score to maximise [ eval/accuracy | eval/validation | eval/loss | eval/precision |
eval/recall ] (DEFAULT: eval/f1)
--overwrite_output_dir
override output directory (DEFAULT: OFF)
--no_shuffle turn off random shuffling (DEFAULT: SHUFFLE)
--wandb_off run hyperparameter tuning using the wandb api and log training in real time online (DEFAULT:
ON)
Note
If using the --config_from_run option, note that this inherits the original output directory paths. Make sure you specify a new --output_dir and enable the --overwrite_output_dir flag. This also inherits the device specifications (gpu or cpu).