genomicBERT: Train a deep learning classifier#
This explains the use of train.py
. Use this if you already know what hyperparameters are needed. Otherwise use sweep.py
. For conventional machine learning, the sweep, train and cross validation steps are combined in one operation.
Source data#
Source data is a HuggingFace dataset
object as a csv
, json
or parquet
file. Specify --format
accordingly.
Results#
Note
Entry points are available if this is installed using the automated conda method. You can then use the command line argument directly, for example: create_dataset_bio
. If not, you will need to use the script directly, which follows the same naming pattern, for example: python create_dataset_bio.py
.
Running the code as below:
python train_model.py <TRAIN_DATA> <FORMAT> <TOKENISER_PATH> --test TEST_DATA --valid VALIDATION_DATA --hyperparameter_file PARAMS.JSON --entity_name WANDB_ENTITY_NAME --project_name WANDB_PROJECT_NAME --group_name WANDB_GROUP_NAME --sweep_count N --metric_opt [ eval/accuracy | eval/validation | eval/loss | eval/precision | eval/recall ] --output_dir OUTPUT_DIR --label_names labels
Note
Remember to provide the --label_names
argument! This is labels
by default (if this wasn’t changed in any previous part of the pipeline).
You will obtain a json
file with weights for each token. Any special tokens you add will also be present. This will be used in the next step of creating a HuggingFace
compatible dataset object.
Notes#
The original documentation to specify training arguments is available here.
Usage#
The full list of arguments is truncated, and only arguments added by this package are shown. These are available on the corresponding HuggingFace transformers.TrainingArguments
documentation shown above.
python train.py -h
Take HuggingFace dataset and train. Arguments match that of
TrainingArguments, with the addition of [ train, test, valid, tokeniser_path,
vocab_size, model, device, entity_name, project_name, group_name,
config_from_run, metric_opt, hyperparameter_file, no_shuffle, wandb_off,
override_output_dir ]. See: https://huggingface.co/docs/transformers/v4.19.4/
en/main_classes/trainer#transformers.TrainingArguments
positional arguments:
train path to [ csv | csv.gz | json | parquet ] file
format specify input file type [ csv | json | parquet ]
tokeniser_path path to tokeniser.json file to load data from
options:
-h, --help show this help message and exit
--output_dir OUTPUT_DIR
The output directory where the model predictions and
checkpoints will be written. (default: None)
--overwrite_output_dir [OVERWRITE_OUTPUT_DIR]
Overwrite the content of the output directory. Use
this to continue training if output_dir points to a
checkpoint directory. (default: False)
-t TEST, --test TEST path to [ csv | csv.gz | json | parquet ] file
(default: None)
-v VALID, --valid VALID
path to [ csv | csv.gz | json | parquet ] file
(default: None)
-m MODEL, --model MODEL
choose model [ distilbert | longformer ] distilbert
handles shorter sequences up to 512 tokens longformer
handles longer sequences up to 4096 tokens (DEFAULT:
distilbert) (default: distilbert)
-d DEVICE, --device DEVICE
choose device [ cpu | cuda:0 ] (DEFAULT: detect)
(default: None)
-s VOCAB_SIZE, --vocab_size VOCAB_SIZE
vocabulary size for model configuration (default:
32000)
-f HYPERPARAMETER_FILE, --hyperparameter_file HYPERPARAMETER_FILE
provide torch.bin or json file of hyperparameters.
NOTE: if given, this overrides all
HfTrainingArguments! This is overridden by
--config_from_run! (default: )
-e ENTITY_NAME, --entity_name ENTITY_NAME
provide wandb team name (if available). NOTE: has no
effect if wandb is disabled. (default: )
-p PROJECT_NAME, --project_name PROJECT_NAME
provide wandb project name (if available). NOTE: has
no effect if wandb is disabled. (default: )
-g GROUP_NAME, --group_name GROUP_NAME
provide wandb group name (if desired). (default:
train)
-c CONFIG_FROM_RUN, --config_from_run CONFIG_FROM_RUN
load arguments from existing wandb run. NOTE: if
given, this overrides --hyperparameter_file!
(default: None)
METRIC_OPT, --metric_opt METRIC_OPT
score to maximise [ eval/accuracy | eval/validation |
eval/loss | eval/precision | eval/recall ] (DEFAULT:
eval/f1) (default: eval/f1)
--override_output_dir
override output directory (DEFAULT: OFF) (default:
False)
--no_shuffle turn off random shuffling (DEFAULT: SHUFFLE)
(default: True)
--wandb_off log training in real time online (DEFAULT: ON)
(default: True)
[ADDITIONAL ARGUMENTS TRUNCATED]