Perform cross-validation#
This explains the use of cross_validate.py
for deep learning through the genomicBERT
pipeline. For conventional machine learning, the sweep, train and cross validation steps are combined in one operation.
Source data#
Source data is a HuggingFace dataset
object as a csv
, json
or parquet
file. Specify --format
accordingly. csv
only for non-deep learning.
Results#
Note
Entry points are available if this is installed using the automated conda method. You can then use the command line argument directly, for example: create_dataset_bio
. If not, you will need to use the script directly, which follows the same naming pattern, for example: python create_dataset_bio.py
.
Running the code as below:
Deep learning#
Specify the same data, wandb project, entity and group names as used for sweeping or training. Once the best run is identified by the user, passing the run id into --config_from_run
will automatically load config of the best run from wandb
.
# use the WANDB_ENTITY_NAME, WANDB_PROJECT_NAME and the best run id corresponding to the sweep
# WANDB_GROUP_NAME should be changed to reflect the new category of runs (eg "cval")
python cross_validate.py <TRAIN_DATA> <FORMAT> --test TEST_DATA --valid VALIDATION_DATA --entity_name WANDB_ENTITY_NAME --project_name WANDB_PROJECT_NAME --group_name WANDB_GROUP_NAME --kfolds N --config_from_run WANDB_RUN_ID --output_dir OUTPUT_DIR
Frequency-based approaches#
Cross-validation is carried out within the main pipeline:
python freq_pipeline.py -i [INFILE_PATH ... ] --format "csv" -t TOKENISER_PATH --freq_method [ cvec | tfidf ] --model [ rf | xg ] --kfolds N --sweep_count N --metric_opt [ accuracy | f1 | precision | recall | roc_auc ] --output_dir OUTPUT_DIR
Embedding#
Cross-validation is carried out within the main pipeline:
python embedding_pipeline.py -i [INFILE_PATH ... ] --format "csv" -t TOKENISER_PATH --freq_method [ cvec | tfidf ] --model [ rf | xg ] --kfolds N --sweep_count N --metric_opt [ accuracy | f1 | precision | recall | roc_auc ] --output_dir OUTPUT_DIR
Notes#
The original documentation to specify training arguments is available here.
Usage#
Deep learning#
Sweep parameters and search space should be passed in as a json
file.
python ../src/cross_validate.py -h
usage: cross_validate.py [-h] [--tokeniser_path TOKENISER_PATH] [-t TEST] [-v VALID] [-m MODEL_PATH] [-o OUTPUT_DIR]
[-d DEVICE] [-s VOCAB_SIZE] [-f HYPERPARAMETER_FILE] [-l LABEL_NAMES [LABEL_NAMES ...]]
[-k KFOLDS] [-e ENTITY_NAME] [-g GROUP_NAME] [-p PROJECT_NAME] [-c CONFIG_FROM_RUN]
[-o METRIC_OPT] [--overwrite_output_dir] [--no_shuffle] [--wandb_off]
train format
Take HuggingFace dataset and perform cross validation.
positional arguments:
train path to [ csv | csv.gz | json | parquet ] file
format specify input file type [ csv | json | parquet ]
optional arguments:
-h, --help show this help message and exit
--tokeniser_path TOKENISER_PATH
path to tokeniser.json file to load data from
-t TEST, --test TEST path to [ csv | csv.gz | json | parquet ] file
-v VALID, --valid VALID
path to [ csv | csv.gz | json | parquet ] file
-m MODEL_PATH, --model_path MODEL_PATH
path to pretrained model dir. this should contain files such as [ pytorch_model.bin,
config.yaml, tokeniser.json, etc ]
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
specify path for output (DEFAULT: ./cval_out)
-d DEVICE, --device DEVICE
choose device [ cpu | cuda:0 ] (DEFAULT: detect)
-s VOCAB_SIZE, --vocab_size VOCAB_SIZE
vocabulary size for model configuration
-f HYPERPARAMETER_FILE, --hyperparameter_file HYPERPARAMETER_FILE
provide torch.bin or json file of hyperparameters. NOTE: if given, this overrides all
HfTrainingArguments! This is overridden by --config_from_run!
-l LABEL_NAMES [LABEL_NAMES ...], --label_names LABEL_NAMES [LABEL_NAMES ...]
provide column with label names (DEFAULT: "").
-k KFOLDS, --kfolds KFOLDS
run n number of kfolds (DEFAULT: 8)
-e ENTITY_NAME, --entity_name ENTITY_NAME
provide wandb team name (if available).
-g GROUP_NAME, --group_name GROUP_NAME
provide wandb group name (if desired).
-p PROJECT_NAME, --project_name PROJECT_NAME
provide wandb project name (if available).
-c CONFIG_FROM_RUN, --config_from_run CONFIG_FROM_RUN
load arguments from existing wandb run. NOTE: if given, this overrides --hyperparameter_file!
METRIC_OPT, --metric_opt METRIC_OPT
score to maximise [ eval/accuracy | eval/validation | eval/loss | eval/precision |
eval/recall ] (DEFAULT: eval/f1)
--overwrite_output_dir
override output directory (DEFAULT: OFF)
--no_shuffle turn off random shuffling (DEFAULT: SHUFFLE)
--wandb_off run hyperparameter tuning using the wandb api and log training in real time online (DEFAULT:
ON)
Note
If using the --config_from_run
option, note that this inherits the original output directory paths. Make sure you specify a new --output_dir
and enable the --overwrite_output_dir
flag. This also inherits the device specifications (gpu or cpu).