Perform a hyperparameter sweep ============================== This explains the use of ``sweep.py`` for machine and deep learning through ``genomicBERT``. If you already know what hyperparameters are needed, you can use ``train_model.py``. For conventional machine learning, the sweep, train and cross validation steps are combined in one operation. Source data ----------- Source data is a HuggingFace ``dataset`` object as a ``csv``, ``json`` or ``parquet`` file. Specify ``--format`` accordingly. ``csv`` only for non-deep learning. Results ------- .. NOTE:: Entry points are available if this is installed using the automated conda method. You can then use the command line argument directly, for example: ``create_dataset_bio``. If not, you will need to use the script directly, which follows the same naming pattern, for example: ``python create_dataset_bio.py``. Running the code as below: Deep learning +++++++++++++ :: python sweep.py --test TEST_DATA --valid VALIDATION_DATA --hyperparameter_sweep PARAMS.JSON --entity_name WANDB_ENTITY_NAME --project_name WANDB_PROJECT_NAME --group_name WANDB_GROUP_NAME --sweep_count N --metric_opt [ eval/accuracy | eval/validation | eval/loss | eval/precision | eval/recall ] --output_dir OUTPUT_DIR Frequency-based approaches ++++++++++++++++++++++++++ :: python freq_pipeline.py -i [INFILE_PATH ... ] --format "csv" -t TOKENISER_PATH --freq_method [ cvec | tfidf ] --model [ rf | xg ] --kfolds N --sweep_count N --metric_opt [ accuracy | f1 | precision | recall | roc_auc ] --output_dir OUTPUT_DIR Embedding +++++++++ :: python embedding_pipeline.py -i [INFILE_PATH ... ] --format "csv" -t TOKENISER_PATH --freq_method [ cvec | tfidf ] --model [ rf | xg ] --kfolds N --sweep_count N --metric_opt [ accuracy | f1 | precision | recall | roc_auc ] --output_dir OUTPUT_DIR Notes ----- The `original documentation to specify training arguments is available here`_. .. _original documentation to specify training arguments is available here: https://huggingface.co/docs/transformers/v4.19.4/en/main_classes/trainer#transformers.TrainingArguments Usage ----- genomicBERT: Deep learning ++++++++++++++++++++++++++ Sweep parameters and search space should be passed in as a ``json`` file. .. raw:: html
Example hyperparameter.json file .. code-block:: json { "name" : "random", "method" : "random", "metric": { "name": "eval/f1", "goal": "maximize" }, "parameters" : { "epochs" : { "values" : [1, 2, 3] }, "batch_size": { "values": [8, 16, 32, 64] }, "learning_rate" :{ "distribution": "log_uniform_values", "min": 0.0001, "max": 0.1 }, "weight_decay": { "values": [0.0, 0.1, 0.2, 0.3, 0.4, 0.5] } }, "early_terminate": { "type": "hyperband", "s": 2, "eta": 3, "max_iter": 27 } } .. raw:: html
:: usage: sweep.py [-h] [-t TEST] [-v VALID] [-m MODEL] [--model_features MODEL_FEATURES] [-o OUTPUT_DIR] [-d DEVICE] [-s VOCAB_SIZE] [-w HYPERPARAMETER_SWEEP] [-l LABEL_NAMES [LABEL_NAMES ...]] [-n SWEEP_COUNT] [-e ENTITY_NAME] [-p PROJECT_NAME] [-g GROUP_NAME] [-c METRIC_OPT] [-r RESUME_SWEEP] [--fp16_off] [--wandb_off] train format tokeniser_path Take HuggingFace dataset and perform parameter sweeping. positional arguments: train path to [ csv | csv.gz | json | parquet ] file format specify input file type [ csv | json | parquet ] tokeniser_path path to tokeniser.json file to load data from options: -h, --help show this help message and exit -t TEST, --test TEST path to [ csv | csv.gz | json | parquet ] file -v VALID, --valid VALID path to [ csv | csv.gz | json | parquet ] file -m MODEL, --model MODEL choose model [ distilbert | longformer ] distilbert handles shorter sequences up to 512 tokens longformer handles longer sequences up to 4096 tokens (DEFAULT: distilbert) --model_features MODEL_FEATURES number of features in data to use (DEFAULT: ALL) NOTE: this is separate from the vocab_size argument. under normal circumstances (eg a tokeniser generated by tokenise_bio), setting this is not necessary -o OUTPUT_DIR, --output_dir OUTPUT_DIR specify path for output (DEFAULT: ./sweep_out) -d DEVICE, --device DEVICE choose device [ cpu | cuda:0 ] (DEFAULT: detect) -s VOCAB_SIZE, --vocab_size VOCAB_SIZE vocabulary size for model configuration -w HYPERPARAMETER_SWEEP, --hyperparameter_sweep HYPERPARAMETER_SWEEP run a hyperparameter sweep with config from file -l LABEL_NAMES [LABEL_NAMES ...], --label_names LABEL_NAMES [LABEL_NAMES ...] provide column with label names (DEFAULT: ""). -n SWEEP_COUNT, --sweep_count SWEEP_COUNT run n hyperparameter sweeps (DEFAULT: 64) -e ENTITY_NAME, --entity_name ENTITY_NAME provide wandb team name (if available). -p PROJECT_NAME, --project_name PROJECT_NAME provide wandb project name (if available). -g GROUP_NAME, --group_name GROUP_NAME provide wandb group name (if desired). METRIC_OPT, --metric_opt METRIC_OPT score to maximise [ eval/accuracy | eval/validation | eval/loss | eval/precision | eval/recall ] (DEFAULT: eval/f1) -r RESUME_SWEEP, --resume_sweep RESUME_SWEEP provide sweep id to resume sweep. --fp16_off turn fp16 off for precision / cpu (DEFAULT: ON) --wandb_off run hyperparameter tuning using the wandb api and log training in real time online (DEFAULT: ON) Frequency based approach ++++++++++++++++++++++++ :: python freq_pipeline.py -h usage: freq_pipeline.py [-h] [--infile_path INFILE_PATH [INFILE_PATH ...]] [--format FORMAT] [--embeddings EMBEDDINGS] [--chunk_size CHUNK_SIZE] [-t TOKENISER_PATH] [-f FREQ_METHOD] [--column_names COLUMN_NAMES] [--column_name COLUMN_NAME] [-m MODEL] [-e MODEL_FEATURES] [-k KFOLDS] [--ngram_from NGRAM_FROM] [--ngram_to NGRAM_TO] [--split_train SPLIT_TRAIN] [--split_test SPLIT_TEST] [--split_val SPLIT_VAL] [-o OUTPUT_DIR] [-s VOCAB_SIZE] [--special_tokens SPECIAL_TOKENS [SPECIAL_TOKENS ...]] [-w HYPERPARAMETER_SWEEP] [--sweep_method SWEEP_METHOD] [-n SWEEP_COUNT] [-c METRIC_OPT] [-j NJOBS] [-d PRE_DISPATCH] Take HuggingFace dataset and perform parameter sweeping. options: -h, --help show this help message and exit --infile_path INFILE_PATH [INFILE_PATH ...] path to [ csv | csv.gz | json | parquet ] file --format FORMAT specify input file type [ csv | json | parquet ] --embeddings EMBEDDINGS path to embeddings model file --chunk_size CHUNK_SIZE iterate over input file for these many rows -t TOKENISER_PATH, --tokeniser_path TOKENISER_PATH path to tokeniser.json file to load data from -f FREQ_METHOD, --freq_method FREQ_METHOD choose dist [ cvec | tfidf ] (DEFAULT: tfidf) --column_names COLUMN_NAMES column name for sp tokenised data (DEFAULT: input_str) --column_name COLUMN_NAME column name for extracting embeddings (DEFAULT: input_str) -m MODEL, --model MODEL choose model [ rf | xg ] (DEFAULT: rf) -e MODEL_FEATURES, --model_features MODEL_FEATURES number of features in data to use (DEFAULT: ALL) -k KFOLDS, --kfolds KFOLDS number of cross validation folds (DEFAULT: 8) --ngram_from NGRAM_FROM ngram slice starting index (DEFAULT: 1) --ngram_to NGRAM_TO ngram slice ending index (DEFAULT: 1) --split_train SPLIT_TRAIN proportion of training data (DEFAULT: 0.90) --split_test SPLIT_TEST proportion of testing data (DEFAULT: 0.05) --split_val SPLIT_VAL proportion of validation data (DEFAULT: 0.05) -o OUTPUT_DIR, --output_dir OUTPUT_DIR specify path for output (DEFAULT: ./results_out) -s VOCAB_SIZE, --vocab_size VOCAB_SIZE vocabulary size for model configuration --special_tokens SPECIAL_TOKENS [SPECIAL_TOKENS ...] assign special tokens, eg space and pad tokens (DEFAULT: ["", "", "", "", ""]) -w HYPERPARAMETER_SWEEP, --hyperparameter_sweep HYPERPARAMETER_SWEEP run a hyperparameter sweep with config from file --sweep_method SWEEP_METHOD specify sweep search strategy [ bayes | grid | random ] (DEFAULT: random) -n SWEEP_COUNT, --sweep_count SWEEP_COUNT run n hyperparameter sweeps (DEFAULT: 8) -c METRIC_OPT, --metric_opt METRIC_OPT score to maximise [ accuracy | f1 | precision | recall ] (DEFAULT: f1) -j NJOBS, --njobs NJOBS run on n threads (DEFAULT: -1) -d PRE_DISPATCH, --pre_dispatch PRE_DISPATCH specify dispatched jobs (DEFAULT: 0.5*n_jobs) Embedding based approach ++++++++++++++++++++++++ :: python embedding_pipeline.py -h usage: embedding_pipeline.py [-h] [--infile_path INFILE_PATH [INFILE_PATH ...]] [--format FORMAT] [--embeddings EMBEDDINGS] [--chunk_size CHUNK_SIZE] [-t TOKENISER_PATH] [-f FREQ_METHOD] [--column_names COLUMN_NAMES] [--column_name COLUMN_NAME] [-m MODEL] [-e MODEL_FEATURES] [-k KFOLDS] [--ngram_from NGRAM_FROM] [--ngram_to NGRAM_TO] [--split_train SPLIT_TRAIN] [--split_test SPLIT_TEST] [--split_val SPLIT_VAL] [-o OUTPUT_DIR] [-s VOCAB_SIZE] [--special_tokens SPECIAL_TOKENS [SPECIAL_TOKENS ...]] [-w HYPERPARAMETER_SWEEP] [--sweep_method SWEEP_METHOD] [-n SWEEP_COUNT] [-c METRIC_OPT] [-j NJOBS] [-d PRE_DISPATCH] Take HuggingFace dataset and perform parameter sweeping. options: -h, --help show this help message and exit --infile_path INFILE_PATH [INFILE_PATH ...] path to [ csv | csv.gz | json | parquet ] file --format FORMAT specify input file type [ csv | json | parquet ] --embeddings EMBEDDINGS path to embeddings model file --chunk_size CHUNK_SIZE iterate over input file for these many rows -t TOKENISER_PATH, --tokeniser_path TOKENISER_PATH path to tokeniser.json file to load data from -f FREQ_METHOD, --freq_method FREQ_METHOD choose dist [ embed ] (DEFAULT: embed) --column_names COLUMN_NAMES column name for sp tokenised data (DEFAULT: input_str) --column_name COLUMN_NAME column name for extracting embeddings (DEFAULT: input_str) -m MODEL, --model MODEL choose model [ rf | xg ] (DEFAULT: rf) -e MODEL_FEATURES, --model_features MODEL_FEATURES number of features in data to use (DEFAULT: ALL) -k KFOLDS, --kfolds KFOLDS number of cross validation folds (DEFAULT: 8) --ngram_from NGRAM_FROM ngram slice starting index (DEFAULT: 1) --ngram_to NGRAM_TO ngram slice ending index (DEFAULT: 1) --split_train SPLIT_TRAIN proportion of training data (DEFAULT: 0.90) --split_test SPLIT_TEST proportion of testing data (DEFAULT: 0.05) --split_val SPLIT_VAL proportion of validation data (DEFAULT: 0.05) -o OUTPUT_DIR, --output_dir OUTPUT_DIR specify path for output (DEFAULT: ./results_out) -s VOCAB_SIZE, --vocab_size VOCAB_SIZE vocabulary size for model configuration --special_tokens SPECIAL_TOKENS [SPECIAL_TOKENS ...] assign special tokens, eg space and pad tokens (DEFAULT: ["", "", "", "", ""]) -w HYPERPARAMETER_SWEEP, --hyperparameter_sweep HYPERPARAMETER_SWEEP run a hyperparameter sweep with config from file --sweep_method SWEEP_METHOD specify sweep search strategy [ bayes | grid | random ] (DEFAULT: random) -n SWEEP_COUNT, --sweep_count SWEEP_COUNT run n hyperparameter sweeps (DEFAULT: 8) -c METRIC_OPT, --metric_opt METRIC_OPT score to maximise [ accuracy | f1 | precision | recall ] (DEFAULT: f1) -j NJOBS, --njobs NJOBS run on n threads (DEFAULT: -1) -d PRE_DISPATCH, --pre_dispatch PRE_DISPATCH specify dispatched jobs (DEFAULT: 0.5*n_jobs)