Documentation for the Health Intelligence Machine Learning toolbox hi-ml
This toolbox helps to simplify and streamline work on deep learning models for healthcare and life sciences, by providing tested components (data loaders, pre-processing), deep learning models, and cloud integration tools.
The hi-ml toolbox provides
Functionality to easily run Python code in Azure Machine Learning services
Low-level and high-level building blocks for Machine Learning / AI researchers and practitioners.
First steps: How to run your Python code in the cloud
The simplest use case for the hi-ml
toolbox is taking a script that you developed, and run it inside of
Azure Machine Learning (AML) services. This can be helpful because the cloud gives you access to massive GPU
resource, you can consume vast datasets, and access multiple machines at the same time for distributed training.
Setting up AzureML
You need to have an AzureML workspace in your Azure subscription.
Download the config file from your AzureML workspace, as described
here. Put this file (it
should be called config.json
) into the folder where your script lives, or one of its parent folders. You can use
parent folders up to the last parent that is still included in the PYTHONPATH
environment variable: hi-ml
will
try to be smart and search through all folders that it thinks belong to your current project.
Using the AzureML integration layer
Consider a simple use case, where you have a Python script that does something - this could be training a model,
or pre-processing some data. The hi-ml
package can help easily run that on Azure Machine Learning (AML) services.
Here is an example script that reads images from a folder, resizes and saves them to an output folder:
from pathlib import Path
if __name__ == '__main__':
input_folder = Path("/tmp/my_dataset")
output_folder = Path("/tmp/my_output")
for file in input_folder.glob("*.jpg"):
contents = read_image(file)
resized = contents.resize(0.5)
write_image(output_folder / file.name)
Doing that at scale can take a long time. We’d like to run that script in AzureML, consume the data from a folder in blob storage, and write the results back to blob storage, so that we can later use it as an input for model training.
You can achieve that by adding a call to submit_to_azure_if_needed
from the hi-ml
package:
from pathlib import Path
from health_azure import submit_to_azure_if_needed
if __name__ == '__main__':
current_file = Path(__file__)
run_info = submit_to_azure_if_needed(compute_cluster_name="preprocess-ds12",
input_datasets=["images123"],
# Omit this line if you don't create an output dataset (for example, in
# model training scripts)
output_datasets=["images123_resized"],
default_datastore="my_datastore")
# When running in AzureML, run_info.input_datasets and run_info.output_datasets will be populated,
# and point to the data coming from blob storage. For runs outside AML, the paths will be None.
# Replace the None with a meaningful path, so that we can still run the script easily outside AML.
input_dataset = run_info.input_datasets[0] or Path("/tmp/my_dataset")
output_dataset = run_info.output_datasets[0] or Path("/tmp/my_output")
files_processed = []
for file in input_dataset.glob("*.jpg"):
contents = read_image(file)
resized = contents.resize(0.5)
write_image(output_dataset / file.name)
files_processed.append(file.name)
# Any other files that you would not consider an "output dataset", like metrics, etc, should be written to
# a folder "./outputs". Any files written into that folder will later be visible in the AzureML UI.
# run_info.output_folder already points to the correct folder.
stats_file = run_info.output_folder / "processed_files.txt"
stats_file.write_text("\n".join(files_processed))
Once these changes are in place, you can submit the script to AzureML by supplying an additional --azureml
flag
on the commandline, like python myscript.py --azureml
.
Note that you do not need to modify the argument parser of your script to recognize the --azureml
flag.
Essential arguments to submit_to_azure_if_needed
When calling submit_to_azure_if_needed
, you can to supply the following parameters:
compute_cluster_name
(Mandatory): The name of the AzureML cluster that should run the job. This can be a cluster with CPU or GPU machines. See here for documentationentry_script
: The script that should be run. If omitted, thehi-ml
package will assume that you would like to submit the script that is presently running, given insys.argv[0]
.snapshot_root_directory
: The directory that contains all code that should be packaged and sent to AzureML. All Python code that the script uses must be copied over. This defaults to the current working directory, but can be one of its parents. If you would like to explicitly skip some folders inside thesnapshot_root_directory
, then useignored_folders
to specify those.conda_environment_file
: The conda configuration file that describes which packages are necessary for your script to run. If omitted, thehi-ml
package searches for a file calledenvironment.yml
in the current folder or its parents.
You can also supply an input dataset. For data pre-processing scripts, you can add an output dataset (omit this for ML training scripts).
To use datasets, you need to provision a data store in your AML workspace, that points to your training data in blob storage. This is described here.
input_datasets=["images123"]
in the code above means that the script will consume all data in folderimages123
in blob storage as the input. The folder must exist in blob storage, in the location that you gave when creating the datastore. Once the script has run, it will also register the data in this folder as an AML dataset.output_datasets=["images123_resized"]
means that the script will create a temporary folder when running in AML, and while the job writes data to that folder, upload it to blob storage, in the data store.
For more examples, please see examples.md. For more details about datasets, see here.
Additional arguments you should know about
submit_to_azure_if_needed
has a large number of arguments, please check the
API documentation for an exhaustive list.
The particularly helpful ones are listed below.
experiment_name
: All runs in AzureML are grouped in “experiments”. By default, the experiment name is determined by the name of the script you submit, but you can specify a name explicitly with this argument.environment_variables
: A dictionary with the contents of all environment variables that should be set inside the AzureML run, before the script is started.docker_base_image
: This specifies the name of the Docker base image to use for creating the Python environment for your script. The amount of memory to allocate for Docker is given bydocker_shm_size
.num_nodes
: The number of nodes on which your script should run. This is essential for distributed training.tags
: A dictionary mapping from string to string, with additional tags that will be stored on the AzureML run. This is helpful to add metadata about the run for later use.
Conda environments, Alternate pips, Private wheels
The function submit_to_azure_if_needed
tries to locate a Conda environment file in the current folder,
or in the Python path, with the name environment.yml
. The actual Conda environment file to use can be specified
directly with:
run_info = submit_to_azure_if_needed(
...
conda_environment_file=conda_environment_file,
where conda_environment_file
is a pathlib.Path
or a string identifying the Conda environment file to use.
The basic use of Conda assumes that packages listed are published Conda packages or published Python packages on PyPI. However, during development, the Python package may be on Test.PyPI, or in some other location, in which case the alternative package location can be specified directly with:
run_info = submit_to_azure_if_needed(
...
pip_extra_index_url="https://test.pypi.org/simple/",
Finally, it is possible to use a private wheel, if the package is only available locally with:
run_info = submit_to_azure_if_needed(
...
private_pip_wheel_path=private_pip_wheel_path,
where private_pip_wheel_path
is a pathlib.Path
or a string identifying the wheel package to use. In this case,
this wheel will be copied to the AzureML environment as a private wheel.
Connecting to Azure
Authentication
The hi-ml
package uses two possible ways of authentication with Azure.
The default is what is called “Interactive Authentication”. When you submit a job to Azure via hi-ml
, this will
use the credentials you used in the browser when last logging into Azure. If there are no credentials yet, you should
see instructions printed out to the console about how to log in using your browser.
We recommend using Interactive Authentication.
Alternatively, you can use a so-called Service Principal, for example within build pipelines.
Service Principal Authentication
A Service Principal is a form of generic identity or machine account. This is essential if you would like to submit training runs from code, for example from within an Azure pipeline. You can find more information about application registrations and service principal objects here.
If you would like to use Service Principal, you will need to create it in Azure first, and then store 3 pieces of information in 3 environment variables — please see the instructions below. When all the 3 environment variables are in place, your Azure submissions will automatically use the Service Principal to authenticate.
Creating the Service Principal
Navigate back to aka.ms/portal
Navigate to
App registrations
(use the top search bar to find it).Click on
+ New registration
on the top left of the page.Choose a name for your application e.g.
MyServicePrincipal
and clickRegister
.Once it is created you will see your application in the list appearing under
App registrations
. This step might take a few minutes.Click on the resource to access its properties. In particular, you will need the application ID. You can find this ID in the
Overview
tab (accessible from the list on the left of the page).Create an environment variable called
HIML_SERVICE_PRINCIPAL_ID
, and set its value to the application ID you just saw.You need to create an application secret to access the resources managed by this service principal. On the pane on the left find
Certificates & Secrets
. Click on+ New client secret
(bottom of the page), note down your token. Warning: this token will only appear once at the creation of the token, you will not be able to re-display it again later.Create an environment variable called
HIML_SERVICE_PRINCIPAL_PASSWORD
, and set its value to the token you just added.
Providing permissions to the Service Principal
Now that your service principal is created, you need to give permission for it to access and manage your AzureML workspace. To do so:
Go to your AzureML workspace. To find it you can type the name of your workspace in the search bar above.
On the
Overview
page, there is a link to the Resource Group that contains the workspace. Click on that.When on the Resource Group, navigate to
Access control
. Then click on+ Add
>Add role assignment
. A pane will appear on the the right. SelectRole > Contributor
. In theSelect
field type the name of your Service Principal and select it. Finish by clickingSave
at the bottom of the pane.
Azure Tenant ID
The last remaining piece is the Azure tenant ID, which also needs to be available in an environment variable. To get that ID:
Log into Azure
Via the search bar, find “Azure Active Directory” and open it.
In the overview of that, you will see a field “Tenant ID”
Create an environment variable called
HIML_TENANT_ID
, and set that to the tenant ID you just saw.
Datasets
Key concepts
We’ll first outline a few concepts that are helpful for understanding datasets.
Blob Storage
Firstly, there is Azure Blob Storage.
Each blob storage account has multiple containers - you can think of containers as big disks that store files.
The hi-ml
package assumes that your datasets live in one of those containers, and each top level folder corresponds
to one dataset.
AzureML Data Stores
Secondly, there are data stores. This is a concept coming from Azure Machine Learning, described here. Data stores provide access to one blob storage account. They exist so that the credentials to access blob storage do not have to be passed around in the code - rather, the credentials are stored in the data store once and for all.
You can view all data stores in your AzureML workspace by clicking on one of the bottom icons in the left-hand navigation bar of the AzureML studio.
One of these data stores is designated as the default data store.
AzureML Datasets
Thirdly, there are datasets. Again, this is a concept coming from Azure Machine Learning. A dataset is defined by
A data store
A set of files accessed through that data store
You can view all datasets in your AzureML workspace by clicking on one of the icons in the left-hand navigation bar of the AzureML studio.
Preparing data
To simplify usage, the hi-ml
package creates AzureML datasets for you. All you need to do is to
Create a blob storage account for your data, and within it, a container for your data.
Create a data store that points to that storage account, and store the credentials for the blob storage account in it
From that point on, you can drop a folder of files in the container that holds your data. Within the hi-ml
package,
just reference the name of the folder, and the package will create a dataset for you, if it does not yet exist.
Using the datasets
The simplest way of specifying that your script uses a folder of data from blob storage is as follows: Add the
input_datasets
argument to your call of submit_to_azure_if_needed
like this:
from health_azure import submit_to_azure_if_needed
run_info = submit_to_azure_if_needed(...,
input_datasets=["my_folder"],
default_datastore="my_datastore")
input_folder = run_info.input_datasets[0]
What will happen under the hood?
The toolbox will check if there is already an AzureML dataset called “my_folder”. If so, it will use that. If there is no dataset of that name, it will create one from all the files in blob storage in folder “my_folder”. The dataset will be created using the data store provided, “my_datastore”.
Once the script runs in AzureML, it will download the dataset “my_folder” to a temporary folder.
You can access this temporary location by
run_info.input_datasets[0]
, and read the files from it.
More complicated setups are described below.
Input and output datasets
Any run in AzureML can consume a number of input datasets. In addition, an AzureML run can also produce an output dataset (or even more than one).
Output datasets are helpful if you would like to run, for example, a script that transforms one dataset into another.
You can use that via the output_datasets
argument:
from health_azure import submit_to_azure_if_needed
run_info = submit_to_azure_if_needed(...,
input_datasets=["my_folder"],
output_datasets=["new_dataset"],
default_datastore="my_datastore")
input_folder = run_info.input_datasets[0]
output_folder = run_info.output_datasets[0]
Your script can now read files from input_folder
, transform them, and write them to output_folder
. The latter
will be a folder on the temp file system of the machine. At the end of the script, the contents of that temp folder
will be uploaded to blob storage, and registered as a dataset.
Mounting and downloading
An input dataset can be downloaded before the start of the actual script run, or it can be mounted. When mounted, the files are accessed via the network once needed - this is very helpful for large datasets where downloads would create a long waiting time before the job start.
Similarly, an output dataset can be uploaded at the end of the script, or it can be mounted. Mounting here means that all files will be written to blob storage already while the script runs (rather than at the end).
Note: If you are using mounted output datasets, you should NOT rename files in the output folder.
Mounting and downloading can be triggered by passing in DatasetConfig
objects for the input_datasets
argument,
like this:
from health_azure import DatasetConfig, submit_to_azure_if_needed
input_dataset = DatasetConfig(name="my_folder", datastore="my_datastore", use_mounting=True)
output_dataset = DatasetConfig(name="new_dataset", datastore="my_datastore", use_mounting=True)
run_info = submit_to_azure_if_needed(...,
input_datasets=[input_dataset],
output_datasets=[output_dataset])
input_folder = run_info.input_datasets[0]
output_folder = run_info.output_datasets[0]
Local execution
For debugging, it is essential to have the ability to run a script on a local machine, outside of AzureML. Clearly, your script needs to be able to access data in those runs too.
There are two ways of achieving that: Firstly, you can specify an equivalent local folder in the
DatasetConfig
objects:
from pathlib import Path
from health_azure import DatasetConfig, submit_to_azure_if_needed
input_dataset = DatasetConfig(name="my_folder",
datastore="my_datastore",
local_folder=Path("/datasets/my_folder_local"))
run_info = submit_to_azure_if_needed(...,
input_datasets=[input_dataset])
input_folder = run_info.input_datasets[0]
Secondly, if local_folder
is not specified, then the dataset will either be downloaded or mounted to a temporary folder locally, depending on the use_mounting
flag. The path to it will be available in run_info
as above.
input_folder = run_info.input_datasets[0]
Note that mounting the dataset locally is only supported on Linux because it requires the use of the native package libfuse, which must first be installed. Also, if running in a Docker container, it must be started with additional arguments. For more details see here: azureml.data.filedataset.mount.
Making a dataset available at a fixed folder location
Occasionally, scripts expect the input dataset at a fixed location, for example, data is always read from /tmp/mnist
.
AzureML has the capability to download/mount a dataset to such a fixed location. With the hi-ml
package, you can
trigger that behaviour via an additional option in the DatasetConfig
objects:
from health_azure import DatasetConfig, submit_to_azure_if_needed
input_dataset = DatasetConfig(name="my_folder",
datastore="my_datastore",
use_mounting=True,
target_folder="/tmp/mnist")
run_info = submit_to_azure_if_needed(...,
input_datasets=[input_dataset])
# Input_folder will now be "/tmp/mnist"
input_folder = run_info.input_datasets[0]
This is also true when running locally - if local_folder
is not specified and an AzureML workspace can be found, then the dataset will be downloaded or mounted to the target_folder
.
Dataset versions
AzureML datasets can have versions, starting at 1. You can view the different versions of a dataset in the AzureML
workspace. In the hi-ml
toolbox, you would always use the latest version of a dataset unless specified otherwise.
If you do need a specific version, use the version
argument in the DatasetConfig
objects:
from health_azure import DatasetConfig, submit_to_azure_if_needed
input_dataset = DatasetConfig(name="my_folder",
datastore="my_datastore",
version=7)
run_info = submit_to_azure_if_needed(...,
input_datasets=[input_dataset])
input_folder = run_info.input_datasets[0]
Hyperparameter Search via Hyperdrive
HyperDrive runs
can start multiple AzureML jobs in parallel. This can be used for tuning hyperparameters, or executing multiple
training runs for cross validation. To use that with the hi-ml
package, simply supply a HyperDrive configuration
object as an additional argument. Note that this object needs to be created with an empty run_config
argument (this
will later be replaced with the correct run_config
that submits your script.)
The example below shows a hyperparameter search that aims to minimize the validation loss val_loss
, by choosing
one of three possible values for the learning rate commandline argument learning_rate
.
from azureml.core import ScriptRunConfig
from azureml.train.hyperdrive import GridParameterSampling, HyperDriveConfig, PrimaryMetricGoal, choice
from health_azure import submit_to_azure_if_needed
hyperdrive_config = HyperDriveConfig(
run_config=ScriptRunConfig(source_directory=""),
hyperparameter_sampling=GridParameterSampling(
parameter_space={
"learning_rate": choice([0.1, 0.01, 0.001])
}),
primary_metric_name="val_loss",
primary_metric_goal=PrimaryMetricGoal.MINIMIZE,
max_total_runs=5
)
submit_to_azure_if_needed(..., hyperdrive_config=hyperdrive_config)
For further examples, please check the example scripts here, and the HyperDrive documentation.
Using Cheap Low Priority VMs
By using Low Priority machines in AzureML, we can run training at greatly reduced costs (around 20% of the original price, see references below for details). This comes with the risk, though, of having the job interrupted and later re-started. This document describes the inner workings of Low Priority compute, and how to best make use of it.
Because the jobs can get interrupted, low priority machines are not suitable for production workload where time is critical. They do offer a lot of benefits though for long-running training jobs or large scale experimentation, that would otherwise be expensive to carry out.
Setting up the Compute Cluster
Jobs in Azure Machine Learning run in a “compute cluster”. When creating a compute cluster, we can specify the size of the VM, the type and number of GPUs, etc. Doing this via the AzureML UI is described here. Doing it programmatically is described here
One of the setting to tweak when creating the compute cluster is whether the machines are “Dedicated” or “Low Priority”:
Dedicated machines will be permanently allocated to your compute cluster. The VMs in a dedicated cluster will be always available, unless the cluster is set up in a way that it removes idle machine. Jobs will not be interrupted.
Low priority machines effectively make use of spare capacity in the data centers, you can think of them as “dedicated machines that are presently idle”. They are available at a much lower price (around 20% of the price of a dedicated machine). These machines are made available to you until they are needed as dedicated machines somewhere else.
In order to get a compute cluster that operates at the lowest price point, choose
Low priority machines
Set “Minimum number of nodes” to 0, so that the cluster removes all idle machines if no jobs are running.
For details on pricing, check this Azure price calculator, choose “Category: GPU”. The price for low priority VMs is given in the “Spot” column
Behaviour of Low Priority VMs
Jobs can be interrupted at any point, this is called “low priority preemption”. When interrupted, the job stops - there
is no signal that we can make use of to do cleanup or something. All the files that the job has produced up to that
point in the outputs
and logs
folders will be saved to the cloud.
At some later point, the job will be assigned a virtual machine again. When re-started, all the files that the job had
produced in its previous run will be available on disk again where they were before interruption, mounted at the same
path. That is, if the interrupted job wrote a file outputs/foo.txt
, this file will be accessible as outputs/foo.txt
also after the restart.
Note that all AzureML-internal log files that the job produced in a previous run will be overwritten
(this behaviour may change in the future). That is in contrast to the behaviour for metrics that the interrupted job had
saved to AzureML already (for example, metrics written by a call like Run.log("loss", loss_tensor.item())
):
Those metrics are already stored in AzureML, and will still be there when the job restarts. The re-started job will
then append to the metrics that had been written in the previous run. This typically shows as sudden jumps in
metrics, as illustrated here:
In this example, the learning rate was increasing for the first 6 or so epochs. Then the job got preempted, and started
training from scratch, with the initial learning rate and schedule. Note that this behaviour is only an artifact of how
the metrics are stored in AzureML, the actual training is doing the right thing.
How do you verify that your job got interrupted? Usually, you would see a warning displayed on the job page in the AzureML UI, that says something along the lines of “Low priority compute preemption warning: a node has been preempted.” . You can use kinks in metrics as another indicator that your job got preempted: Sudden jumps in metrics after which the metric follows a shape similar to the one at job start usually indicates low priority preemption.
Note that a job can be interrupted more than one time.
Best Practice Guide for Your Jobs
In order to make best use of low priority compute, your code needs to be made resilient to restarts. Essentially, this means that it should write regular checkpoints, and try to use those checkpoint files if they already exist. Examples of how to best do that are given below.
In addition, you need to bear in mind that the job can be interrupted at any moment, for example when it is busy uploading huge checkpoint files to Azure. When trying to upload again after restart, there can be resource collisions.
Writing and Using Recovery Checkpoints
When using PyTorch Lightning, you can add a checkpoint callback to your trainer, that ensures that you save the model
and optimizer to disk in regular intervals. This callback needs to be added to your Trainer
object. Note that these
recovery checkpoints need to be written to the outputs
folder, because only files in this folder get saved to Azure
automatically when the job gets interrupted.
When starting training, your code needs to check if there is already a recovery checkpoint present on disk. If so, training should resume from that point.
Here is a code snippet that illustrates all that:
import re
from pathlib import Path
import numpy as np
from health_ml.utils import AzureMLLogger
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
RECOVERY_CHECKPOINT_FILE_NAME = "recovery_"
CHECKPOINT_FOLDER = "outputs/checkpoints"
def get_latest_recovery_checkpoint():
all_recovery_files = [f for f in Path(CHECKPOINT_FOLDER).glob(RECOVERY_CHECKPOINT_FILE_NAME + "*")]
if len(all_recovery_files) == 0:
return None
# Get recovery checkpoint with highest epoch number
recovery_epochs = [int(re.findall(r"[\d]+", f.stem)[0]) for f in all_recovery_files]
idx_max_epoch = int(np.argmax(recovery_epochs))
return str(all_recovery_files[idx_max_epoch])
recovery_checkpoint = ModelCheckpoint(dirpath=CHECKPOINT_FOLDER,
filename=RECOVERY_CHECKPOINT_FILE_NAME + "{epoch}",
period=10)
trainer = Trainer(default_root_dir="outputs",
callbacks=[recovery_checkpoint],
logger=[AzureMLLogger()],
resume_from_checkpoint=get_latest_recovery_checkpoint())
Additional Optimizers and Other State
In order to be resilient to interruption, your jobs need to save all their state to disk. In PyTorch Lightning training,
this would include all optimizers that you are using. The “normal” optimizer for model training is saved to the
checkpoint by Lightning already. However, you may be using callbacks or other components that maintain state. As an
example, training a linear head for self-supervised learning can be done in a callback, and that callback can have its
own optimizer. Such callbacks need to correctly implement the on_save_checkpoint
method to save their state to the
checkpoint, and on_load_checkpoint
to load it back in.
For more information about persisting state, check the PyTorch Lightning documentation .
Commandline tools
Run TensorBoard
From the command line, run the command
himl-tb
specifying one of
[--experiment] [--latest_run_file] [--run]
This will start a TensorBoard session, by default running on port 6006. To use an alternative port, specify this with --port
.
If --experiment
is provided, the most recent Run from this experiment will be visualised.
If --latest_run_file
is provided, the script will expect to find a RunId in this file.
Alternatively you can specify the Runs to visualise via --run
. This can be a single run id, or multiple ids separated by commas. This argument also accepts one or more run recovery ids, although these are not recommended since it is no longer necessary to provide an experiment name in order to recovery an AML Run.
By default, this tool expects that your TensorBoard logs live in a folder named ‘logs’ and will create a similarly named folder in your root directory. If your TensorBoard logs are stored elsewhere, you can specify this with the --log_dir
argument.
If you choose to specify --experiment
, you can also specify --num_runs
to view and/or --tags
to filter by.
If your AML config path is not ROOT_DIR/config.json, you must also specify --config_file
.
To see an example of how to create TensorBoard logs using PyTorch on AML, see the AML submitting script which submits the following pytorch sample script. Note that to run this, you’ll need to create an environment with pytorch and tensorboard as dependencies, as a minimum. See an example conda environemnt. This will create an experiment named ‘tensorboard_test’ on your Workspace, with a single run. Go to outputs + logs -> outputs to see the tensorboard events file.
Download files from AML Runs
From the command line, run the command
himl-download
specifying one of
[--experiment] [--latest_run_file] [--run]
If --experiment
is provided, the most recent Run from this experiment will be downloaded.
If --latest_run_file
is provided, the script will expect to find a RunId in this file.
Alternatively you can specify the Run to download via --run
. This can be a single run id, or multiple ids separated by commas. This argument also accepts one or more run recovery ids, although these are not recommended since it is no longer necessary to provide an experiment name in order to recovery an AML Run.
The files associated with your Run will be downloaded to the location specified with --output_dir
(by default ROOT_DIR/outputs)
If you choose to specify --experiment
, you can also specify --tags
to filter by.
If your AML config path is not ROOT_DIR/config.json
, you must also specify --config_file
.
Creating your own command line tools
When creating your own command line tools that interact with the Azure ML ecosystem, you may wish to use the
AmlRunScriptConfig
class for argument parsing. This gives you a quickstart way for accepting command line arguments to
specify the following
experiment: a string representing the name of an Experiment, from which to retrieve AML runs
tags: to filter the runs within the given experiment
num_runs: to define the number of most recent runs to return from the experiment
run: to instead define one or more run ids from which to retrieve runs (also supports the older format of run recovery ideas although these are obsolete now)
latest_run_file: to instead provide a path to a file containing the id of your latest run, for retrieval.
config_path: to specify a config.json file in which your workspace settings are defined
You can extend this list of arguments by creating a child class that inherits from AMLRunScriptConfig.
Defining your own argument types
Additional arguments can have any of the following types: bool
, integer
, float
, string
, list
, class/class instance
with no additional work required. You can also define your own custom type, by providing a custom class in your code that
inherits from CustomTypeParam
. It must define 2 methods:
_validate(self, x: Any)
: which should raise aValueError
if x is not of the type you expect, and should also make a callsuper()._validate(val)
from_string(self, y: string)
which takes in the command line arg as a string (y
) and returns an instance of the type that you want. For example, if your custom type is a tuple, this method should create a tuple from the input string and return that. An example of a custom type can be seen in our own custom type:RunIdOrListParam
, which accepts a string representing one or more run ids (or run recovery ids) and returns either a List or a single RunId object (or RunRecoveryId object if appropriate)
Example:
class EvenNumberParam(util.CustomTypeParam):
""" Our custom type param for even numbers """
def _validate(self, val: Any) -> None:
if (not self.allow_None) and val is None:
raise ValueError("Value must not be None")
if val % 2 != 0:
raise ValueError(f"{val} is not an even number")
super()._validate(val) # type: ignore
def from_string(self, x: str) -> int:
return int(x)
class MyScriptConfig(util.AmlRunScriptConfig):
# example of a generic param
simple_string: str = param.String(default="")
# example of a custom param
even_number = EvenNumberParam(2, doc="your choice of even number")
Downloading from/ uploading to Azure ML
All of the below functions will attempt to find a current workspace, if running in Azure ML, or else will attempt to locate ‘config.json’ file in the current directory, and its parents. Alternatively, you can specify your own Workspace object or a path to a file containing the workspace settings.
Download files from an Azure ML Run
To download all files from an AML Run, given its run id, perform the following:
from pathlib import Path
from health_azure import download_files_from_run_id
run_id = "example_run_id_123"
output_folder = Path("path/to/save")
download_files_from_run_id(run_id, output_folder)
Here, “path_to_save” represents the folder in which we want the downloaded files to be stored. E.g. if your run contains the files [“abc/def/1.txt”, “abc/2.txt”] and you specify the prefix “abc” and the output_folder “my_outputs”, you’ll end up with the files [“my_outputs/abc/def/1.txt”, “my_outputs/abc/2.txt”]
If you wish to specify the file name(s) to be downloaded, you can do so with the “prefix” parameter. E.g. prefix=”outputs” will download all files within the “output” folder, if such a folder exists within your Run.
There is an additional parameter, “validate_checksum” which defaults to False. If True, will validate MD5 hash of the data arriving (in chunks) to that being sent.
Note that if your code is running in a distributed manner, files will only be downloaded onto nodes with local rank = 0. E.g. if you have 2 nodes each running 4 processes, the file will be downloaded by CPU/GPU 0 on each of the 2 nodes. All processes will be synchronized to only exit the downloading method once it has completed on all nodes/ranks.
Downloading checkpoint files from a run
To download checkpoint files from an Azure ML Run, perform the following:
from pathlib import Path
from health_azure import download_checkpoints_from_run_id
download_checkpoints_from_run_id("example_run_id_123", Path("path/to/checkpoint/directory"))
All files within the checkpoint directory will be downloaded into the folder specified by “path/to/checkpoint_directory”.
Since checkpoint files are often large and therefore prone to corruption during download, by default, this function will validate the MD5 hash of the data downloaded (in chunks) compared to that being sent.
Note that if your code is running in a distributed manner, files will only be downloaded onto nodes with local rank = 0. E.g. if you have 2 nodes each running 4 processes, the file will be downloaded by CPU/GPU 0 on each of the 2 nodes. All processes will be synchronized to only exit the downloading method once it has completed on all nodes/ranks.
Downloading files from an Azure ML Datastore
To download data from an Azure ML Datastore within your Workspace, follow this example:
from pathlib import Path
from health_azure import download_from_datastore
download_from_datastore("datastore_name", "prefix", Path("path/to/output/directory") )
where “prefix” represents the path to the file(s) to be downloaded, relative to the datastore “datastore_name”.
Azure will search for files within the Datastore whose paths begin with this string.
If you wish to download multiple files from the same folder, set
This function takes additional parameters “overwrite” and “show_progress”. If True, overwrite will overwrite any existing local files with the same path. If False and there is a duplicate file, it will skip this file. If show_progress is set to True, the progress of the file download will be visible in the terminal.
Uploading files to an Azure ML Datastore
To upload data to an Azure ML Datastore within your workspace, perform the following:
from pathlib import Path
from health_azure import upload_to_datastore
upload_to_datastore("datastore_name", Path("path/to/local/data/folder"), Path("path/to/datastore/folder") )
Where “datastore_name” is the name of the registered Datastore within your workspace that you wish to upload to and “path/to/datastore/folder” is the relative path within this Datastore that you wish to upload data to. Note that the path to local data must be a folder, not a single path. The folder name will not be included in the remote path. E.g. if you specify the local_data_dir=”foo/bar” and that contains the files [“1.txt”, “2.txt”], and you specify the remote_path=”baz”, you would see the following paths uploaded to your Datastore: [“baz/1.txt”, “baz/2.txt”]
This function takes additional parameters “overwrite” and “show_progress”. If True, overwrite will overwrite any existing remote files with the same path. If False and there is a duplicate file, it will skip this file. If show_progress is set to True, the progress of the file upload will be visible in the terminal.
Examples
Note: All examples below contain links to sample scripts that are also included in the repository.
The experience is optimized for use on readthedocs. When navigating to the sample scripts on the github UI,
you will only see the .rst
file that links to the .py
file. To access the .py
file, go to the folder that
contains the respective .rst
file.
Basic integration
The sample examples/1/sample.py is a script that takes an optional command line argument of a target value and prints all the prime numbers up to (but not including) this target. It is simply intended to demonstrate a long running operation that we want to run in Azure. Run it using e.g.
cd examples/1
python sample.py -n 103
The sample examples/2/sample.py shows the minimal modifications to run this in AzureML. Firstly create an AzureML workspace and download the config file, as explained here. The config file should be placed in the same folder as the sample script. A sample Conda environment file is supplied. Import the hi-ml package into the current environment. Finally add the following to the sample script:
from health_azure import submit_to_azure_if_needed
...
def main() -> None:
_ = submit_to_azure_if_needed(
compute_cluster_name="lite-testing-ds2",
wait_for_completion=True,
wait_for_completion_show_output=True)
Replace lite-testing-ds2
with the name of a compute cluster created within the AzureML workspace.
If this script is invoked as the first sample, e.g.
cd examples/2
python sample.py -n 103
then the output will be exactly the same. But if the script is invoked as follows:
cd examples/2
python sample.py -n 103 --azureml
then the function submit_to_azure_if_needed
will perform all the required actions to run this script in AzureML and exit. Note that:
code after
submit_to_azure_if_needed
is not run locally, but it is run in AzureML.the print statement prints to the AzureML console output and is available in the
Output + logs
tab of the experiment in the70_driver_log.txt
file, and can be downloaded from there.the command line arguments are passed through (apart from –azureml) when running in AzureML.
a new file:
most_recent_run.txt
will be created containing an identifier of this AzureML run.
A sample script examples/2/results.py demonstrates how to programmatically download the driver log file.
Output files
The sample examples/3/sample.py demonstrates output file handling when running on AzureML. Because each run is performed in a separate VM or cluster then any file output is not generally preserved. In order to keep the output it should be written to the outputs
folder when running in AzureML. The AzureML infrastructure will preserve this and it will be available for download from the outputs
folder in the Output + logs
tab.
Make the following additions:
from health_azure import submit_to_azure_if_needed
run_info = submit_to_azure_if_needed(
...
parser.add_argument("-o", "--output", type=str, default="primes.txt", required=False, help="Output file name")
...
output = run_info.output_folder / args.output
output.write_text("\n".join(map(str, primes)))
When running locally submit_to_azure_if_needed
will create a subfolder called outputs
and then the output can be written to the file args.output
there. When running in AzureML the output will be available in the file args.output
in the Experiment.
A sample script examples/3/results.py demonstrates how to programmatically download the output file.
Output datasets
The sample examples/4/sample.py demonstrates output dataset handling when running on AzureML.
In this case, the following parameters are added to submit_to_azure_if_needed
:
from health_azure import submit_to_azure_if_needed
run_info = submit_to_azure_if_needed(
...
default_datastore="himldatasets",
output_datasets=["himl_sample4_output"],
The default_datastore
is required if using the simplest configuration for an output dataset, to just use the blob container name. There is an alternative that doesn’t require the default_datastore
and allows a different datastore for each dataset:
from health_azure import DatasetConfig, submit_to_azure_if_needed
...
run_info = submit_to_azure_if_needed(
...
output_datasets=[DatasetConfig(name="himl_sample4_output", datastore="himldatasets")]
Now the output folder is constructed as follows:
output_folder = run_info.output_datasets[0] or Path("outputs") / "himl_sample4_output"
output_folder.mkdir(parents=True, exist_ok=True)
output = output_folder / args.output
When running in AzureML run_info.output_datasets[0]
will be populated using the new parameter and the output will be written to that blob storage. When running locally run_info.output_datasets[0]
will be None and a local folder will be created and used.
A sample script examples/4/results.py demonstrates how to programmatically download the output dataset file.
For more details about datasets, see here
Input datasets
This example trains a simple classifier on a toy dataset, first creating the dataset files and then in a second script training the classifier.
The script examples/5/inputs.py is provided to prepare the csv files. Run the script to download the Iris dataset and create two CSV files:
cd examples/5
python inputs.py
The training script examples/5/sample.py is modified from https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/ml-frameworks/scikit-learn/train-hyperparameter-tune-deploy-with-sklearn/train_iris.py to work with input csv files. Start it to train the actual classifier, based on the data files that were just written:
cd examples/5
python sample.py
Including input files in the snapshot
When using very small datafiles (in the order of few MB), the easiest way to get the input data to Azure is to include them in the set of (source) files that are uploaded to Azure. You can run the dataset creation script on your local machine, writing the resulting two files to the same folder where your training script is located, and then submit the training script to AzureML. Because the dataset files are in the same folder, they will automatically be uploaded to AzureML.
However, it is not ideal to have the input files in the snapshot: The size of the snapshot is limited to 25 MB. It is better to put the data files into blob storage and use input datasets.
Creating the dataset in AzureML
The suggested way of creating a dataset is to run a script in AzureML that writes an output dataset. This is particularly important for large datasets, to avoid the usually low bandwith from a local machine to the cloud.
This is shown in examples/6/inputs.py:
This script prepares the CSV files in an AzureML run, and writes them to an output dataset called himl_sample6_input
.
The relevant code parts are:
run_info = submit_to_azure_if_needed(
compute_cluster_name="lite-testing-ds2",
default_datastore="himldatasets",
output_datasets=["himl_sample6_input"])
# The dataset files should be written into this folder:
dataset = run_info.output_datasets[0] or Path("dataset")
Run the script:
cd examples/6
python inputs.py --azureml
You can now modify the training script examples/6/sample.py to use the newly created dataset
himl_sample6_input
as an input. To do that, the following parameters are added to submit_to_azure_if_needed
:
run_info = submit_to_azure_if_needed(
compute_cluster_name="lite-testing-ds2",
default_datastore="himldatasets",
input_datasets=["himl_sample6_input"])
When running in AzureML, the dataset will be downloaded before running the job. You can access the temporary folder where the dataset is available like this:
input_folder = run_info.input_datasets[0] or Path("dataset")
The part behind the or
statement is only necessary to keep a reasonable behaviour when running outside of AzureML:
When running in AzureML run_info.input_datasets[0]
will be populated using input dataset specified in the call to
submit_to_azure_if_needed
, and the input will be downloaded from blob storage. When running locally
run_info.input_datasets[0]
will be None
and a local folder should be populated and used.
The default_datastore
is required if using the simplest configuration for an input dataset. There are
alternatives that do not require the default_datastore
and allows a different datastore for each dataset, for example:
from health_azure import DatasetConfig, submit_to_azure_if_needed
...
run_info = submit_to_azure_if_needed(
...
input_datasets=[DatasetConfig(name="himl_sample7_input", datastore="himldatasets"],
For more details about datasets, see here
Uploading the input files manually
An alternative to writing the dataset in AzureML (as suggested above) is to create them on the local machine, and upload them manually directly to Azure blob storage.
This is shown in examples/7/inputs.py: This script prepares the CSV files
and uploads them to blob storage, in a folder called himl_sample7_input
. Run the script:
cd examples/7
python inputs_via_upload.py
As in the above example, you can now modify the training script examples/7/sample.py to use
an input dataset that has the same name as the folder where the files just got uploaded. In this case, the following
parameters are added to submit_to_azure_if_needed
:
run_info = submit_to_azure_if_needed(
...
default_datastore="himldatasets",
input_datasets=["himl_sample7_input"],
Hyperdrive
The sample examples/8/sample.py demonstrates adding hyperparameter tuning. This shows the same hyperparameter search as in the AzureML sample.
Make the following additions:
from azureml.core import ScriptRunConfig
from azureml.train.hyperdrive import HyperDriveConfig, PrimaryMetricGoal, choice
from azureml.train.hyperdrive.sampling import RandomParameterSampling
...
def main() -> None:
param_sampling = RandomParameterSampling({
"--kernel": choice('linear', 'rbf', 'poly', 'sigmoid'),
"--penalty": choice(0.5, 1, 1.5)
})
hyperdrive_config = HyperDriveConfig(
run_config=ScriptRunConfig(source_directory=""),
hyperparameter_sampling=param_sampling,
primary_metric_name='Accuracy',
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
max_total_runs=12,
max_concurrent_runs=4)
run_info = submit_to_azure_if_needed(
...
hyperdrive_config=hyperdrive_config)
Note that this does not make sense to run locally, it should always be run in AzureML. When invoked with:
cd examples/8
python sample.py --azureml
this will perform a Hyperdrive run in AzureML, i.e. there will be 12 child runs, each randomly drawing from the parameter sample space. AzureML can plot the metrics from the child runs, but to do that, some small modifications are required.
Add in:
run = run_info.run
...
args = parser.parse_args()
run.log('Kernel type', np.str(args.kernel))
run.log('Penalty', np.float(args.penalty))
...
print('Accuracy of SVM classifier on test set: {:.2f}'.format(accuracy))
run.log('Accuracy', np.float(accuracy))
and these metrics will be displayed on the child runs tab in the Experiment page on AzureML.
Controlling when to submit to AzureML and when not
By default, the hi-ml
package assumes that you supply a commandline argument --azureml
(that can be anywhere on
the commandline) to trigger a submission of the present script to AzureML. If you wish to control it via a different
flag, coming out of your own argument parser, use the submit_to_azureml
argument of the function
health.azure.himl.submit_to_azure_if_needed
.
Training with k-fold cross validation in Azure ML
It is possible to create a parent run on Azure ML that is associated with one or more child runs (see here for further information.) This is useful in circumstances such as k-fold cross-validation, where individual child run perform validation on a different data split. When a HyperDriveRun is created in Azure ML, it follows this same principle and generates multiple child runs, associated with one parent.
To train with k-fold cross validation using submit_to_azure_if_needed
, you must do two things.
Call the helper function
create_crossval_hyperdrive_config
to create an AML HyperDriveConfig object representing your parent run. It will have one child run for each of the k-fold splits you request, as followsfrom health_azure import create_crossval_hyperdrive_config hyperdrive_config = create_crossval_hyperdrive_config(num_splits, cross_val_index_arg_name=cross_val_index_arg_name, metric_name=metric_name)
where:
num_splits
is the number of k-fold cross validation splits you requirecross_val_index_arg_name
is the name of the argument given to each child run, whose value denotes which split that child represents (this parameter defaults to ‘cross_validation_split_index’, in which case, supposing you specified 2 cross validation splits, one would receive the arguments [‘–cross_validation_split_index’ ‘0’] and the other would receive [‘–cross_validation_split_index’ ‘1’]]. It is up to you to then use these args to retrieve the correct split from your data.metrics_name
represents the name of a metric that you will compare your child runs by. NOTE the run will expect to find this metric, otherwise it will fail as described here You can log this metric in your training script as follows:
from azureml.core import Run # Example of logging a metric called <metric_name> to an AML Run. loss = <my_loss_calc> run_log = Run.get_context() run_log.log(metric_name, loss)
See the documentation here for further explanation.
The hyperdrive_config returned above must be passed into the function
submit_to_azure_if_needed
as follows:run_info = submit_to_azure_if_needed( ... hyperdrive_config=hyperdrive_config )
This will create a parent (HyperDrive) Run, with
num_cross_validation_split
children - each one associated with a different data split.
Retrieving the aggregated results of a cross validation/ HyperDrive run
You can retrieve a Pandas DataFrame of the aggregated results from your cross validation run as follows:
from health_azure import aggregate_hyperdrive_metrics
df = aggregate_hyperdrive_metrics(run_id, child_run_arg_name)
where:
run_id
is a string representing the id of your HyperDriveRun. Note that this must be an instance of an AML HyperDriveRun.child_run_arg_name
is a string representing the name of the argument given to each child run to denote its position relative to other child runs (e.g. this arg could equal ‘child_run_index’ - then each of your child runs should expect to receive the arg ‘–child_run_index’ with a value <= the total number of child runs)
If your HyperDrive run has 2 children, each logging the metrics epoch, accuracy and loss, the result would look like this:
| | 0 | 1 |
|--------------|-----------------|--------------------|
| epoch | [1, 2, 3] | [1, 2, 3] |
| accuracy | [0.7, 0.8, 0.9] | [0.71, 0.82, 0.91] |
| loss | [0.5, 0.4, 0.3] | [0.45, 0.37, 0.29] |
here each column is one of the splits/ child runs, and each row is one of the metrics you have logged to the run.
It is possible to log rows and tables in Azure ML by calling run.log_table and run.log_row respectively. In this case, the DataFrame will contain a Dictionary entry instead of a list, where the keys are the table columns (or keywords provided to log_row), and the values are the table values. e.g.
| | 0 | 1 |
|----------------|------------------------------------------|-------------------------------------------|
| accuracy_table |{'epoch': [1, 2], 'accuracy': [0.7, 0.8]} | {'epoch': [1, 2], 'accuracy': [0.8, 0.9]} |
It is also posisble to log plots in Azure ML by calling run.log_image and passing in a matplotlib plot. In this case, the DataFrame will contain a string representing the path to the artifact that is generated by AML (the saved plot in the Logs & Outputs pane of your run on the AML portal). E.g.
| | 0 | 1 |
|----------------|-----------------------------------------|---------------------------------------|
| accuracy_plot | aml://artifactId/ExperimentRun/dcid.... | aml://artifactId/ExperimentRun/dcid...|
Logging metrics when training models in AzureML
This section describes the basics of logging to AzureML, and how this can be simplified when using PyTorch Lightning. It also describes helper functions to make logging more consistent across your code.
Basics
The mechanics of writing metrics to an ML training run inside of AzureML are described here.
Using the hi-ml-azure
toolbox, you can simplify that like this:
from health_azure import RUN_CONTEXT
...
RUN_CONTEXT.log(name="name_of_the_metric", value=my_tensor.item())
Similarly you can log strings (via the log_text
method) or figures (via the log_image
method), see the
documentation.
Using PyTorch Lightning
The hi-ml
toolbox relies on pytorch-lightning
for a lot of its functionality.
Logging of metrics is described in detail
here
hi-ml
provides a Lightning-ready logger object to use with AzureML. You can add that to your trainer as you would
add a Tensorboard logger, and afterwards see all metrics in both your Tensorboard files and in the AzureML UI.
This logger can be added to the Trainer
object as follows:
from health_ml.utils import AzureMLLogger
from pytorch_lightning.loggers import TensorBoardLogger
tb_logger = TensorBoardLogger("logs/")
azureml_logger = AzureMLLogger()
trainer = Trainer(logger=[tb_logger, azureml_logger])
You do not need to make any changes to your logging code to write to both loggers at the same time. This means
that, if your code correctly writes to Tensorboard in a local run, you can expect the metrics to come out correctly
in the AzureML UI as well after adding the AzureMLLogger
.
Making logging consistent when training with PyTorch Lightning
A common problem of training scripts is that the calls to the logging methods tend to run out of sync.
The .log
method of a LightningModule
has a lot of arguments, some of which need to be set correctly when running
on multiple GPUs.
To simplify that, there is a function log_on_epoch
that turns synchronization across nodes on/off depending on the
number of GPUs, and always forces the metrics to be logged upon epoch completion. Use as follows:
from health_ml.utils import log_on_epoch
from pytorch_lightning import LightningModule
class MyModule(LightningModule):
def training_step(self, *args, **kwargs):
...
loss = my_loss(y_pred, y)
log_on_epoch(self, loss)
return loss
Logging learning rates
Logging learning rates is important for monitoring training, but again this can add overhead. To log learning rates easily and consistently, we suggest either of two options:
Add a
LearningRateMonitor
callback to your trainer, as described hereUse the
hi-ml
functionlog_learning_rate
The log_learning_rate
function can be used at any point the training code, like this:
from health_ml.utils import log_learning_rate
from pytorch_lightning import LightningModule
class MyModule(LightningModule):
def training_step(self, *args, **kwargs):
...
log_learning_rate(self, "learning_rate")
loss = my_loss(y_pred, y)
return loss
log_learning_rate
will log values from all learning rate schedulers, and all learning rates if a scheduler
returns multiple values. In this example, the logged metric will be learning_rate
if there is a single scheduler
that outputs a single LR, or learning_rate/1/0
to indicate the value coming from scheduler index 1, value index 0.
Performance Diagnostics
The hi-ml
toolbox offers several components to integrate with PyTorch Lightning based training workflows:
The
AzureMLProgressBar
is a replacement for the default progress bar that the Lightning Trainer uses. Its output is more suitable for display in an offline setup like AzureML.The
BatchTimeCallback
can be added to the trainer to detect performance issues with data loading.
AzureMLProgressBar
The standard PyTorch Lightning is well suited for interactive training sessions on a GPU machine, but its output can get
confusing when run inside AzureML. The AzureMLProgressBar
class can replace the standard progress bar, and optionally
adds timestamps to each progress event. This makes it easier to later correlate training progress with, for example, low
GPU utilization showing in AzureML’s GPU monitoring.
Here’s a code snippet to add the progress bar to a PyTorch Lightning Trainer object:
from health_ml.utils import AzureMLProgressBar
from pytorch_lightning import Trainer
progress = AzureMLProgressBar(refresh_rate=100, print_timestamp=True)
trainer = Trainer(callbacks=[progress])
This produces progress information like this:
2021-10-20T06:06:07Z Training epoch 18 (step 94): 5/5 (100%) completed. 00:00 elapsed, total epoch time ~ 00:00
2021-10-20T06:06:07Z Validation epoch 18: 2/2 (100%) completed. 00:00 elapsed, total epoch time ~ 00:00
2021-10-20T06:06:07Z Training epoch 19 (step 99): 5/5 (100%) completed. 00:00 elapsed, total epoch time ~ 00:00
...
BatchTimeCallback
This callback can help diagnose issues with low performance of data loading. It captures the time between the end of a training or validation step, and the start of the next step. This is often indicative of the time it takes to retrieve the next batch of data: When the data loaders are not performant enough, this time increases.
The BatchTimeCallback
will detect minibatches where the estimated data loading time is too high, and print alerts.
These alerts will be printed at most 5 times per epoch, for a maximum of 3 epochs, to avoid cluttering the output.
Note that it is common for the first minibatch of data in an epoch to take a long time to load, because data loader processes need to spin up.
The callback will log a set of metrics:
timing/train/batch_time [sec] avg
andtiming/train/batch_time [sec] max
: Average and maximum time that it takes for batches to train/validatetiming/train/batch_loading_over_threshold [sec]
is the total time wasted per epoch in waiting for the next batch of data. This is computed by looking at all batches where the batch loading time was over the thresholdmax_batch_load_time_seconds
(that is set in the constructor of the callback), and totalling the batch loading time for those batches.timing/train/epoch_time [sec]
is the time for an epoch to complete.
Caveats
In distributed training, the performance metrics will be collected at rank 0 only.
The time between the end of a batch and the start of the next batch is also impacted by other callbacks. If you have callbacks that are particularly expensive to run, for example because they actually have their own model training, the results of the
BatchTimeCallback
may be misleading.
Usage example
from health_ml.utils import BatchTimeCallback
from pytorch_lightning import Trainer
batchtime = BatchTimeCallback(max_batch_load_time_seconds=0.5)
trainer = Trainer(callbacks=[batchtime])
This would produce output like this:
Epoch 18 training: Loaded the first minibatch of data in 0.00 sec.
Epoch 18 validation: Loaded the first minibatch of data in 0.00 sec.
Epoch 18 training took 0.02sec, of which waiting for data took 0.01 sec total.
Epoch 18 validation took 0.00sec, of which waiting for data took 0.00 sec total.
Notes for developers
Creating a Conda environment
To create a separate Conda environment with all packages that hi-ml
requires for running and testing,
use the provided environment.yml
file. Create a Conda environment called himl
from that via
conda env create --file environment.yml
conda activate himl
Installing pyright
We are using static typechecking for our code via mypy
and pyright
. The latter requires a separate installation
outside the Conda environment. For WSL, these are the required steps (see also
here):
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.38.0/install.sh | bash
nvm install node
npm install -g pyright
Using specific versions hi-ml
in your Python environments
If you’d like to test specific changes to the hi-ml
package in your code, you can use two different routes:
You can clone the
hi-ml
repository on your machine, and usehi-ml
in your Python environment via a local package install:
pip install -e <your_git_folder>/hi-ml
You can consume an early version of the package from
test.pypi.org
viapip
:
pip install --extra-index-url https://test.pypi.org/simple/ hi-ml==0.1.0.post165
If you are using Conda, you can add an additional parameter for
pip
into the Condaenvironment.yml
file like this:
name: foo
dependencies:
- pip=20.1.1
- python=3.7.3
- pip:
- --extra-index-url https://test.pypi.org/simple/
- hi-ml==0.1.0.post165
Common things to do
The repository contains a makefile with definitions for common operations.
make check
: Runflake8
andmypy
on the repository.make test
: Runflake8
andmypy
on the repository, then all tests viapytest
make pip
: Install all packages for running and testing in the current interpreter.make conda
: Update the hi-ml Conda environment and activate it
Building documentation
To build the sphinx documentation, you must have sphinx and related packages installed
(see build_requirements.txt
in the repository root). Then run:
cd docs
make html
This will build all your documentation in docs/build/html
.
Setting up your AzureML workspace
In the browser, navigate to the AzureML workspace that you want to use for running your tests.
In the top right section, there will be a dropdown menu showing the name of your AzureML workspace. Expand that.
In the panel, there is a link “Download config file”. Click that.
This will download a file
config.json
. Move that file to the root folder of yourhi-ml
repository. The file name is already present in.gitignore
, and will hence not be checked in.
Creating and Deleting Docker Environments in AzureML
Passing a
docker_base_image
intosubmit_to_azure_if_needed
causes a new image to be built and registered in your workspace (see docs for more information).To remove an environment use the az ml environment delete function in the AzureML CLI (note that all the parameters need to be set, none are optional).
Testing
For all of the tests to work locally you will need to cache your AzureML credentials. One simple way to do this is to
run the example in src/health/azure/examples
(i.e. run python elevate_this.py --message='Hello World' --azureml
or
make example
) after editing elevate_this.py
to reference your compute cluster.
When running the tests locally, they can either be run against the source directly, or the source built into a package.
To run the tests against the source directly in the local
src
folder, ensure that there is no wheel in thedist
folder (for example by runningmake clean
). If a wheel is not detected, then the localsrc
folder will be copied into the temporary test folder as part of the test process.To run the tests against the source as a package, build it with
make build
. This will build the localsrc
folder into a new wheel in thedist
folder. This wheel will be detected and passed to AzureML as a private package as part of the test process.
Creating a New Release
To create a new package release, follow these steps:
Double-check that
CHANGELOG.md
is up-to-date: It should contain a section for the next package version with subsections Added/Changed/…On the repository’s github page, click on “Releases”, then “Draft a new release”
In the “Draft a new release” page, click “Choose a tag”. In the text box, enter a (new) tag name that has the desired version number, plus a “v” prefix. For example, to create package version 0.12.17, create a tag
v0.12.17
. Then choose “+ Create new tag” below the text box.Enter a “Release title” that highlights the main feature(s) of this new package version.
Click “Auto-generate release notes” to pull in the titles of the Pull Requests since the last release.
Before the auto-generated “What’s changed” section, add a few sentences that summarize what’s new.
Click “Publish release”
Contributing to this toolbox
We welcome all contributions that help us achieve our aim of speeding up ML/AI research in health and life sciences. Examples of contributions are
Data loaders for specific health & life sciences data
Network architectures and components for deep learning models
Tools to analyze and/or visualize data
All contributions to the toolbox need to come with unit tests, and will be reviewed when a Pull Request (PR) is started.
If in doubt, reach out to the core hi-ml
team before starting your work.
Please look through the existing folder structure to find a good home for your contribution.
Submitting a Pull Request
If you’d like to submit a PR to the codebase, please ensure you:
Include a brief description
Link to an issue, if relevant
Write unit tests for the code - see below for details.
Add appropriate documentation for any new code that you introduce
Ensure that you modified CHANGELOG.md and described your PR there.
Only publish your PR for review once you have a build that is passing. You can make use of the “Create as Draft” feature of GitHub.
Code style
We use
flake8
as a linter, andmypy
and pyright for static typechecking. Both tools run as part of the PR build, and must run without errors for a contribution to be accepted.mypy
requires that all functions and methods carry type annotations, see mypy documentationWe highly recommend to run all those tools before pushing the latest changes to a PR. If you have
make
installed, you can run both tools in one go viamake check
(from the repository root folder)
Unit testing
DO write unit tests for each new function or class that you add.
DO extend unit tests for existing functions or classes if you change their core behaviour.
DO try your best to write unit tests that are fast. Very often, this can be done by reducing data size to a minimum. Also, it is helpful to avoid long-running integration tests, but try to test at the level of the smallest involved function.
DO ensure that your tests are designed in a way that they can pass on the local machine, even if they are relying on specific cloud features. If required, use
unittest.mock
to simulate the cloud features, and hence enable the tests to run successfully on your local machine.DO run all unit tests on your dev machine before submitting your changes. The test suite is designed to pass completely also outside of cloud builds.
DO NOT rely only on the test builds in the cloud (i.e., run test locally before submitting). Cloud builds trigger AzureML runs on GPU machines that have a far higher CO2 footprint than your dev machine.
When fixing a bug, the suggested workflow is to first write a unit test that shows the invalid behaviour, and only then start to code up the fix.
Correct Sphinx Documentation
Common mistakes when writing docstrings:
There must be a separating line between a function description and the documentation for its parameters.
In multi-line parameter descriptions, continuations on the next line must be indented.
Sphinx will merge the class description and the arguments of the constructor
__init__
. Hence, there is no need to write any text in the constructor, only the classes’ parameters.Use
>>>
to include code snippets. PyCharm will run intellisense on those to make authoring easier.To generate the Sphinx documentation on your dev machine, run
make html
in the./docs
folder, and then open./docs/build/html/index.html
Example:
class Foo:
"""
This is the class description.
The following block will be pretty-printed by Sphinx. Note the space between >>> and the code!
Usage example:
>>> from module import Foo
>>> foo = Foo(bar=1.23)
"""
ANY_ATTRIBUTE = "what_ever."
"""Document class attributes after the attribute."""
def __init__(self, bar: float = 0.5) -> None:
"""
:param bar: This is a description for the constructor argument.
Long descriptions should be indented.
"""
self.bar = bar
def method(self, arg: int) -> None:
"""
Method description, followed by an empty line.
:param arg: This is a description for the method argument.
Long descriptions should be indented.
"""
Loading Images
There are many libraries available that can load png images. Simple examples were made using most of them and time of execution was compared. The goal was to load a png file, either RGB or greyscale, into a torch.Tensor.
The tensor specification was:
shape [3, Height, Width] (for RGB images, in RGB order) or [1, Height, Width] (for greyscale images);
dtype float32;
scaled to between 0.0 and 1.0.
matplotlib
Two methods using matplotlib were compared. The first manipulates the numpy array from the image read before creating the torch tensor, the second uses torchvision to do the transformation.
from pathlib import Path
import matplotlib.image as mpimg
import numpy as np
import torch
import torchvision.transforms.functional as TF
def read_image_matplotlib(input_filename: Path) -> torch.Tensor:
"""
Read an image file with matplotlib and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
# https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.imread.html
# numpy_array is a numpy.array of shape: (H, W), (H, W, 3), or (H, W, 4)
# where H = height, W = width
numpy_array = mpimg.imread(input_filename)
if len(numpy_array.shape) == 2:
# if loaded a greyscale image, then it is of shape (H, W) so add in an extra axis
numpy_array = np.expand_dims(numpy_array, 2)
# transpose to shape (C, H, W)
numpy_array = np.transpose(numpy_array, (2, 0, 1))
torch_tensor = torch.from_numpy(numpy_array)
return torch_tensor
def read_image_matplotlib2(input_filename: Path) -> torch.Tensor:
"""
Read an image file with matplotlib and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
# https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.imread.html
# numpy_array is a numpy.array of shape: (H, W), (H, W, 3), or (H, W, 4)
# where H = height, W = width
numpy_array = mpimg.imread(input_filename)
torch_tensor = TF.to_tensor(numpy_array)
return torch_tensor
OpenCV
Two methods using the Python interface to OpenCV were compared. The first manipulates the numpy array from the image read before creating the torch tensor, the second uses torchvision to do the transformation. Note that OpenCV loads images in BGR format, so they need to be transformed.
from pathlib import Path
import cv2
import numpy as np
import torch
import torchvision.transforms.functional as TF
def read_image_opencv(input_filename: Path) -> torch.Tensor:
"""
Read an image file with OpenCV and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
# https://docs.opencv.org/4.5.3/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56
# numpy_array is a numpy.ndarray, in BGR format.
numpy_array = cv2.imread(str(input_filename))
numpy_array = cv2.cvtColor(numpy_array, cv2.COLOR_BGR2RGB)
is_greyscale = False not in \
((numpy_array[:, :, 0] == numpy_array[:, :, 1]) == (numpy_array[:, :, 1] == numpy_array[:, :, 2]))
if is_greyscale:
numpy_array = numpy_array[:, :, 0]
if len(numpy_array.shape) == 2:
# if loaded a greyscale image, then it is of shape (H, W) so add in an extra axis
numpy_array = np.expand_dims(numpy_array, 2)
numpy_array = np.float32(numpy_array) / 255.0
# transpose to shape (C, H, W)
numpy_array = np.transpose(numpy_array, (2, 0, 1))
torch_tensor = torch.from_numpy(numpy_array)
return torch_tensor
def read_image_opencv2(input_filename: Path) -> torch.Tensor:
"""
Read an image file with OpenCV and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
# https://docs.opencv.org/4.5.3/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56
# numpy_array is a numpy.ndarray, in BGR format.
numpy_array = cv2.imread(str(input_filename))
numpy_array = cv2.cvtColor(numpy_array, cv2.COLOR_BGR2RGB)
is_greyscale = False not in \
((numpy_array[:, :, 0] == numpy_array[:, :, 1]) == (numpy_array[:, :, 1] == numpy_array[:, :, 2]))
if is_greyscale:
numpy_array = numpy_array[:, :, 0]
torch_tensor = TF.to_tensor(numpy_array)
return torch_tensor
Pillow
Pillow is one of the easiest libraries to use because torchvision has a function to convert directly from Pillow images.
from pathlib import Path
from PIL import Image
import torch
import torchvision.transforms.functional as TF
def read_image_pillow(input_filename: Path) -> torch.Tensor:
"""
Read an image file with pillow and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
pil_image = Image.open(input_filename)
torch_tensor = TF.to_tensor(pil_image)
return torch_tensor
SciPy
SciPy is also easy to use because it loads images into a numpy array of the expected shape so that it can easily be transformed into a torch tensor.
from pathlib import Path
import imageio
import torch
import torchvision.transforms.functional as TF
def read_image_scipy(input_filename: Path) -> torch.Tensor:
"""
Read an image file with scipy and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
numpy_array = imageio.imread(input_filename)
torch_tensor = TF.to_tensor(numpy_array)
return torch_tensor
SimpleITK
SimpleITK requires a two step process to load an image and extract the data as a numpy array, but it is then in the correct format.
from pathlib import Path
import SimpleITK as sitk
import torch
import torchvision.transforms.functional as TF
def read_image_sitk(input_filename: Path) -> torch.Tensor:
"""
Read an image file with SimpleITK and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
itk_image = sitk.ReadImage(str(input_filename))
numpy_array = sitk.GetArrayFromImage(itk_image)
torch_tensor = TF.to_tensor(numpy_array)
return torch_tensor
scikit-image
scikit-image is also very simple to use, since it loads the image as a numpy array in the correct format.
from pathlib import Path
from skimage import io
import torch
import torchvision.transforms.functional as TF
def read_image_skimage(input_filename: Path) -> torch.Tensor:
"""
Read an image file with scikit-image and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
numpy_array = io.imread(input_filename)
torch_tensor = TF.to_tensor(numpy_array)
return torch_tensor
numpy
For comparison, the png image data was saved in the numpy native data format and then reloaded.
from pathlib import Path
import numpy as np
import torch
import torchvision.transforms.functional as TF
def read_image_numpy(input_filename: Path) -> torch.Tensor:
"""
Read an Numpy file with Torch and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
numpy_array = np.load(input_filename)
torch_tensor = torch.from_numpy(numpy_array)
return torch_tensor
torch
Again, for comparison, the png image data was saved in the torch tensor native data format and then reloaded.
from pathlib import Path
import torch
def read_image_torch2(input_filename: Path) -> torch.Tensor:
"""
Read a Torch file with Torch and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
torch_tensor = torch.load(input_filename)
return torch_tensor
Results
All the above methods were ran against 122 small test images, repeated 10 times. So in total there were 1220 calls to each of the functions.
RGB Images
For 61 RGB images of size 224 x 224 pixels and 61 of size 180 x 224 pixels, repeated 10 times, there are the following timings:
Function |
Total time (s) |
---|---|
read_image_matplotlib |
9.81336 |
read_image_matplotlib2 |
9.96016 |
read_image_opencv |
12.4301 |
read_image_opencv2 |
12.6227 |
read_image_pillow |
16.2288 |
read_image_scipy |
17.9958 |
read_image_sitk |
63.6669 |
read_image_skimage |
18.273 |
read_image_numpy |
7.29741 |
read_image_torch2 |
7.07304 |
Greyscale Images
Similarly, with greyscale versions of the RGB images:
Function |
Total time (s) |
---|---|
read_image_matplotlib |
8.32523 |
read_image_matplotlib2 |
8.26399 |
read_image_opencv |
11.6838 |
read_image_opencv2 |
11.7935 |
read_image_pillow |
15.7406 |
read_image_scipy |
17.9061 |
read_image_sitk |
71.8732 |
read_image_skimage |
18.0698 |
read_image_numpy |
7.94197 |
read_image_torch2 |
7.73153 |
The recommendation therefore is to use matplotlib mpimg.imread
to load the image and TF.to_tensor
to transform the numpy array to a torch tensor. This is almost as fast as loading the data directly in a native numpy or torch format.
Whole Slide Images
Computational Pathology works with image files that can be very large in size, up to many GB. These files may be too large to load entirely into memory at once, or at least too large to act as training data. Instead they may be split into multiple tiles of a much smaller size, e.g. 224x224 pixels before being used for training. There are two popular libraries used for handling this type of image:
but they both come with trade offs and complications.
In development there is also tifffile, but this is untested.
OpenSlide
There is a Python interface for OpenSlide at openslide-python, but this first requires the installation of the OpenSlide library itself. This can be done on Ubuntu with:
apt-get install openslide-tools
On Windows follow the instructions here and make sure that the install directory is added to the system path.
Once the shared library/dlls are installed, install the Python interface with:
pip install openslide-python
cuCIM
cuCIM is much easier to install, it can be done entirely with the Python package: cucim. However, there are the following caveats:
It requires a GPU, with NVIDIA driver 450.36+
It requires CUDA 11.0+
It supports only a subset of tiff image files.
The suitable AzureML base Docker images are therefore the ones containing cuda11
, and the compute instance must contain a GPU.
Performance
An exploratory set of scripts are at slide_image_loading for comparing loading images with OpenSlide or cuCIM, and performing tiling using both libraries.
Loading and saving at lowest resolution
Four test tiff files are used:
a 44.5 MB file with level dimensions: ((27648, 29440), (6912, 7360), (1728, 1840))
a 19.9 MB file with level dimensions: ((5888, 25344), (1472, 6336), (368, 1584))
a 5.5 MB file with level dimensions: ((27648, 29440), (6912, 7360), (1728, 1840)), but acting as a mask
a 2.1 MB file with level dimensions: ((5888, 25344), (1472, 6336), (368, 1584)), but acting as a mask
For OpenSlide the following code:
with OpenSlide(str(input_file)) as img:
count = img.level_count
dimensions = img.level_dimensions
print(f"level_count: {count}")
print(f"dimensions: {dimensions}")
for k, v in img.properties.items():
print(k, v)
region = img.read_region(location=(0, 0),
level=count-1,
size=dimensions[count-1])
region.save(output_file)
took an average of 29ms to open the file, 88ms to read the region, and 243ms to save the region as a png.
For cuCIM the following code:
img = cucim.CuImage(str(input_file))
count = img.resolutions['level_count']
dimensions = img.resolutions['level_dimensions']
print(f"level_count: {count}")
print(f"level_dimensions: {dimensions}")
print(img.metadata)
region = img.read_region(location=(0, 0),
size=dimensions[count-1],
level=count-1)
np_img_arr = np.asarray(region)
img2 = Image.fromarray(np_img_arr)
img2.save(output_file)
took an average of 369ms to open the file, 7ms to read the region and 197ms to save the region as a png, but note that it failed to handle the mask images.
Loading and saving as tiles at the medium resolution
Test code created tiles of size 224x224 pilfes, loaded the mask images, and used occupancy levels to decide which tiles to create and save from level 1 - the middle resolution. This was profiled against both images, as above.
For cuCIM the total time was 4.7s, 2.48s to retain the tiles as a Numpy stack but not save them as pngs. cuCIM has the option of cacheing images, but is actually made performance slightly worse, possibly because the natural tile sizes in the original tiffs were larger than the tile sizes.
For OpenSlide the comparable total times were 5.7s, and 3.26s.
health_azure Package
Functions
|
Creates an AzureML run configuration, that contains information about environment, multi node execution, and Docker. |
|
Creates an AzureML ScriptRunConfig object, that holds the information about the snapshot, the entry script, and its arguments. |
|
For a given Azure ML run id, first retrieve the Run, and then download all files, which optionally start with a given prefix. |
|
Given an Azure ML run id, download all files from a given checkpoint directory within that run, to the path specified by output_path. |
|
Download file(s) from an Azure ML Datastore that are registered within a given Workspace. |
|
Finds an existing run in an experiment, based on a recovery ID that contains the experiment ID and the actual RunId. |
|
Gets the name of the most recently executed AzureML run, instantiates that Run object and returns it. |
|
Retrieve an Azure ML Workspace from one of several places: |
|
Returns True if the given run is inside of an AzureML machine, or False if it is on a machine outside AzureML. |
Sets the environment variables that PyTorch Lightning needs for multi-node training. |
|
|
Splits a run ID into the experiment name and the actual run. |
|
Starts an AzureML run on a given workspace, via the script_run_config. |
Submit a folder to Azure, if needed and run it. |
|
This is a barrier to use in distributed jobs. |
|
|
Upload a folder to an Azure ML Datastore that is registered within a given Workspace. |
|
Creates an Azure ML HyperDriveConfig object for running cross validation. |
|
For a given HyperDrive run id, retrieves the metrics from each of its children and then aggregates it. |
Classes
|
This class stores all information that a script needs to run inside and outside of AzureML. |
|
Contains information to use AzureML datasets as inputs or outputs. |
health_ml.utils Package
Functions
|
Logs the learning rate(s) used by the given module. |
|
Write a dictionary with metrics and/or an individual metric as a name/value pair to the loggers of the given module. |
Classes
A Pytorch Lightning logger that stores metrics in the current AzureML run. |
|
|
A PL progress bar that works better in AzureML. |
|
This callback provides tools to measure batch loading time and other diagnostic information. |