Running ML experiments with hi-ml
The hi-ml toolbox is capable of training any PyTorch Lighting (PL) model inside of AzureML, making use of these features:
Training on a local GPU machine or inside of AzureML without code changes
Working with different models in the same codebase, and selecting one by name
Distributed training in AzureML
Logging via AzureML’s native capabilities
Evaluation of the trained model on new datasets
This can be used by invoking the hi-ml runner and providing the name of the container class, like this:
himl-runner --model=MyContainer
.
There is a fully working example HelloContainer, that
implements a simple 1-dimensional regression model from data stored in a CSV file. You can run that
from the command line by himl-runner --model=HelloWorld
.
Specifying the model to run
The --model
argument specifies the name of a class that should be used for model training. The class needs to
be a subclass of LightningContainer
, see below. There are different ways of telling the runner where to find
that class:
If just providing a single class name, like
--model=HelloWorld
, the class is expected somewhere in thehealth_ml.configs
namespace. It can be in any module/folder inside of that namespace.If the class is outside of the
health_ml.configs
(as would be normal if using thehiml-runner
from a package), you need to provide some “hints” where to start searching. It is enough to provide the start of the namespace string: for example,--model health_cpath.PandaImageNetMIL
is effectively telling the runner to search for thePandaImageNetMIL
class anywhere in thehealth_cpath
namespace. You can think of this ashealth_cpath.*.PandaImageNetMIL
Running ML experiments in Azure ML
To train in AzureML, use the flag --cluster
to specify the name of the cluster
in your Workspace that you want to submit the job to. So the whole command would look like:
himl-runner --model=HelloWorld --cluster=my_cluster_name
You can also specify --num_nodes
if you wish to distribute the model training.
When starting the runner, you need to do that from a directory that contains all the code that your experiment needs: The current working directory will be used as the root of all data that will be copied to AzureML to run your experiment. (the only exception to this rule is if you start the runner from within an enlistment of the HI-ML GitHub repository).
AzureML needs to know which Python/Conda environment it should use. For that, the runner needs a file environment.yml
that contains a Conda environment definition. This file needs to be present either in the current working directory or
one of its parents. To specify a Conda environment that is located elsewhere, you can use
himl-runner --model=HelloWorld --cluster=my_cluster_name --conda_env=/my/folder/to/special_environment.yml
Setup - creating your model config file
In order to use these capabilities, you need to implement a class deriving from
health_ml.lightning_container.LightningContainer
. This class encapsulates everything that is needed for training
with PyTorch Lightning:
For example:
class MyContainer(LightningContainer):
def __init__(self):
super().__init__()
self.azure_datasets = ["folder_name_in_azure_blob_storage"]
self.local_datasets = [Path("/some/local/path")]
self.max_epochs = 42
def create_model(self) -> LightningModule:
return MyLightningModel()
def get_data_module(self) -> LightningDataModule:
return MyDataModule(root_path=self.local_dataset)
The create_model
method needs to return a subclass of PyTorch Lightning’s LightningModule, that has
all the usual PyTorch Lightning methods required for training, like the training_step
and forward
methods. E.g:
class MyLightningModel(LightningModule):
def __init__(self):
self.layer = ...
def training_step(self, *args, **kwargs):
...
def forward(self, *args, **kwargs):
...
def configure_optimizers(self):
...
def test_step(self, *args, **kwargs):
...
The get_data_module
method of the container needs to return a DataModule (inheriting from a PyTorch Lightning DataModule) which contains all of the logic for
downloading, preparing and splitting your dataset, as well as methods for wrapping the train, val and test datasets
respectively with DataLoaders. E.g:
class MyDataModule(LightningDataModule):
def __init__(self, root_path: Path):
# All data should be read from the folder given in self.root_path
self.root_path = root_path
def train_dataloader(self, *args, **kwargs) -> DataLoader:
# The data should be read off self.root_path
train_dataset = ...
return DataLoader(train_dataset, batch_size=5, num_workers=5)
def val_dataloader(self, *args, **kwargs) -> DataLoader:
# The data should be read off self.root_path
val_dataset = ...
return DataLoader(val_dataset, batch_size=5, num_workers=5)
def test_dataloader(self, *args, **kwargs) -> DataLoader:
# The data should be read off self.root_path
test_dataset = ...
return DataLoader(test_dataset, batch_size=5, num_workers=5)
So, the full file would look like:
from pathlib import Path
from torch.utils.data import DataLoader
from pytorch_lightning import LightningModule, LightningDataModule
from health_ml.lightning_container import LightningContainer
class MyLightningModel(LightningModule):
def __init__(self):
self.layer = ...
def training_step(self, *args, **kwargs):
...
def forward(self, *args, **kwargs):
...
def configure_optimizers(self):
...
def test_step(self, *args, **kwargs):
...
class MyDataModule(LightningDataModule):
def __init__(self, root_path: Path):
# All data should be read from the folder given in self.root_path
self.root_path = root_path
def train_dataloader(self, *args, **kwargs) -> DataLoader:
# The data should be read off self.root_path
train_dataset = ...
return DataLoader(train_dataset, batch_size=5, num_workers=5)
def val_dataloader(self, *args, **kwargs) -> DataLoader:
# The data should be read off self.root_path
val_dataset = ...
return DataLoader(val_dataset, batch_size=5, num_workers=5)
def test_dataloader(self, *args, **kwargs) -> DataLoader:
# The data should be read off self.root_path
test_dataset = ...
return DataLoader(test_dataset, batch_size=5, num_workers=5)
class MyContainer(LightningContainer):
def __init__(self):
super().__init__()
self.azure_datasets = ["folder_name_in_azure_blob_storage"]
self.local_datasets = [Path("/some/local/path")]
self.max_epochs = 42
def create_model(self) -> LightningModule:
return MyLightningModel()
def get_data_module(self) -> LightningDataModule:
return MyDataModule(root_path=self.local_dataset)
By default, config files will be looked for in the folder “health_ml.configs”. To specify config files
that live elsewhere, use a fully qualified name for the parameter --model
- e.g. “MyModule.Configs.my_config.py”
Outputting files during training
The Lightning model returned by create_model
needs to write its output files to the current working directory.
When running inside of AzureML, the output folders will be directly under the project root. If not running inside
AzureML, a folder with a timestamp will be created for all outputs and logs.
When running in AzureML, the folder structure will be set up such that all files written to the current working directory are later uploaded to Azure blob storage at the end of the AzureML job. The files will also be later available via the AzureML UI.
Trainer arguments
All arguments that control the PyTorch Lightning Trainer
object are defined in the class TrainerParams
. A
LightningContainer
object inherits from this class. The most essential one is the max_epochs
field, which controls
the max_epochs
argument of the Trainer
.
For example:
from pytorch_lightning import LightningModule, LightningDataModule
from health_ml.lightning_container import LightningContainer
class MyContainer(LightningContainer):
def __init__(self):
super().__init__()
self.max_epochs = 42
def create_model(self) -> LightningModule:
return MyLightningModel()
def get_data_module(self) -> LightningDataModule:
return MyDataModule(root_path=self.local_dataset)
Optimizer and LR scheduler arguments
To the optimizer and LR scheduler: the Lightning model returned by create_model
should define its own
configure_optimizers
method, with the same signature as LightningModule.configure_optimizers
,
and returns a tuple containing the Optimizer and LRScheduler objects
Run inference with a pretrained model
You can use the hi-ml-runner in inference mode only by switching the --run_inference_only
flag on and specifying
the model weights by setting --src_checkpoint
argument. With this flag, the model will be evaluated on the test set
only. There is also an option for evaluating the model an a full dataset, described further below.
Specifying the checkpoint to use
When running inference on a trained model, you need to provide a model checkpoint that should be used. This is done via
the --src_checkpoint
argument. This supports three types of checkpoints:
A local path where the checkpoint is stored
--src_checkpoint=local/path/to/my_checkpoint/model.ckpt
A remote URL from where to download the weights
--src_checkpoint=https://my_checkpoint_url.com/model.ckpt
An AzureML run id where checkpoints are saved in
outputs/checkpoints
. For this specific use case, you can experiment with different checkpoints by setting--src_checkpoint
according to the format<azureml_run_id>:<optional/custom/path/to/checkpoints/><filename.ckpt>
. If no custom path is provided (e.g.,--src_checkpoint=AzureML_run_id:best.ckpt
), we assume the checkpoints to be saved in the default checkpoints folderoutputs/checkpoints
. If no filename is provided (e.g.,--src_checkpoint=AzureML_run_id
), the last epoch checkpointoutputs/checkpoints/last.ckpt
will be loaded.
Refer to Checkpoints Utils for more details on how checkpoints are parsed.
Running inference on the test set
When supplying the flag --run_inference_only
on the commandline, no model training will be run, and only inference on the
test set will be done:
The model weights will be loaded from the location specified by
--src_checkpoint
A PyTorch Lightining
Trainer
object will be instantiated.The test set will be read out from the data module specified by the
get_data_module
method of theLightningContainer
object.The model will be evaluated on the test set, by running
trainer.test
. Any special logic to use during the test step will need to be added to the model’stest_step
method.
Running the following command line will run inference using the MyContainer
model with weights from the checkpoint saved
in the AzureMl run MyContainer_XXXX_yyyy
at the best validation loss epoch /outputs/checkpoints/best_val_loss.ckpt
.
himl-runner --model=Mycontainer --run_inference_only --src_checkpoint=MyContainer_XXXX_yyyy:best_val_loss.ckpt
Running inference on a full dataset
When supplying the flag --mode=eval_full
on the commandline, no model training will be run, and the model will be
evaluated on a dataset different from the training/validation/test dataset. This dataset is loaded via the
get_eval_data_module
method of the container.
The model weights will be loaded from the location specified by
--src_checkpoint
A PyTorch Lightining
Trainer
object will be instantiated.The test set will be read out from the data module specified by the
get_eval_data_module
method of theLightningContainer
object. The data module itself can read data from a mounted Azure dataset, which will be made availabe for the container at the pathself.local_datasets
. In a typical use-case, all the data in that dataset will be put into thetest_dataloader
field of the data module.The model will be evaluated on the test set, by running
trainer.test
. Any special logic to use during the test step will need to added to the model’stest_step
method.
Running the following command line will run inference using the MyContainer
model with weights from the checkpoint saved
in the AzureMl run MyContainer_XXXX_yyyy
at the best validation loss epoch /outputs/checkpoints/best_val_loss.ckpt
.
himl-runner --model=Mycontainer --src_checkpoint=MyContainer_XXXX_yyyy:best_val_loss.ckpt --mode=eval_full --azure_datasets=my_new_dataset
The example code snippet here shows how to add a method that reads the inference dataset. In this example, we assume
that the MyDataModule
class has an argument splits
that specifies the fraction of data to go into the training,
validation, and test data loaders.
class MyContainer(LightningContainer):
def __init__(self):
super().__init__()
self.azure_datasets = ["folder_name_in_azure_blob_storage"]
self.local_datasets = [Path("/some/local/path")]
self.max_epochs = 42
def create_model(self) -> LightningModule:
return MyLightningModel()
def get_data_module(self) -> LightningDataModule:
return MyDataModule(root_path=self.local_dataset, splits=(0.7, 0.2, 0.1))
def get_eval_data_module(self) -> LightningDataModule:
return MyDataModule(root_path=self.local_dataset, splits=(0.0, 0.0, 1.0))
Resume training from a given checkpoint
Analogously, one can resume training by setting --src_checkpoint
and --resume_training
to train a model longer.
The pytorch lightning trainer will initialize the lightning module from the given checkpoint corresponding to the best
validation loss epoch as set in the following comandline.
himl-runner --model=Mycontainer --cluster=my_cluster_name --src_checkpoint=MyContainer_XXXX_yyyy:best_val_loss.ckpt --resume_training
Warning: When resuming training, one should make sure to set container.max_epochs
greater than the last epoch of the
specified checkpoint. A misconfiguration exception will be raised otherwise:
pytorch_lightning.utilities.exceptions.MisconfigurationException: You restored a checkpoint with current_epoch=19, but you have set Trainer(max_epochs=4).
Logging to AzureML when running outside AzureML
The runner offers the ability to log metrics to AzureML, even if the present training is not running inside of AzureML. This adds an additional level of traceability for runs on GPU VMs, where there is otherwise no record of any past training.
You can trigger this behaviour by specifying the --log_from_vm
flag. For the HelloWorld
model, this
will look like:
himl-runner --model=HelloWorld --log_from_vm
For logging to work, you need have a config.json
file in the current working directory (or one of its
parent folders) that specifies the AzureML workspace itself. When starting the runner, you will be asked
to authenticate to AzureML.
There are two additional flags that can be used to control the logging behaviour:
The
--experiment
flag sets which AzureML experiment to log to. By default, the experiment name will be the name of the model class (HelloWorld
in the above example).The
--tag
flag sets the display name for the AzureML run. You can use that to give your run a memorable name, and later easily find it in the AzureML UI.
The following command will log to the experiment my_experiment
, in a run that is labelled my_first_run
in the UI:
himl-runner --model=HelloWorld --log_from_vm --experiment=my_experiment --tag=my_first_run
Starting experiments with different seeds
To assess the variability of metrics, it is often useful to run the same experiment multiple times with different seeds.
There is a built-in functionality of the runner to do this. When adding the commandline flag --different_seeds=3
, your
experiment will get run 3 times with seeds 0, 1 and 2. This is equivalent to starting the runner with arguments
--random_seed=0
, --random_seed=1
and --random_seed=2
.
These runs will be started in parallel in AzureML via the HyperDrive framework. It is not possible to run with different
seeds on a local machine, other than by manually starting runs with --random_seed=0
etc.
Common problems with running in AML
"Your total snapshot size exceeds the limit <SNAPSHOT_LIMIT>"
. Cause: The size of your source directory is larger than the limit that AML sets for snapshots. Solution: check for cache files, log files or other files that are not necessary for running your experiment and add them to a.amlignore
file in the root directory. Alternatively, you can see Azure ML documentation for instructions on increasing this limit, although it will make your jobs slower."FileNotFoundError"
. Possible cause: Symlinked files. Azure ML SDK v2 will resolve the symlink and attempt to upload the resolved file. Solution: Remove symlinks from any files that should be uploaded to Azure ML.