Documentation for the Health Intelligence Machine Learning toolbox hi-ml
This toolbox helps to simplify and streamline work on deep learning models for healthcare and life sciences, by providing tested components (data loaders, pre-processing), deep learning models, and cloud integration tools.
The hi-ml toolbox provides
Functionality to easily run Python code in Azure Machine Learning services
Low-level and high-level building blocks for Machine Learning / AI researchers and practitioners.
First steps: How to run your Python code in the cloud
The simplest use case for the hi-ml
toolbox is taking a script that you developed, and run it inside of
Azure Machine Learning (AML) services. This can be helpful because the cloud gives you access to massive GPU
resource, you can consume vast datasets, and access multiple machines at the same time for distributed training.
Setting up AzureML
To run your code in the cloud, you need to have an AzureML workspace in your Azure subscription. Please follow the instructions here to create an AzureML workspace if you don’t have one yet.
Download the config file from your AzureML workspace, as described
here. Put this file (it
should be called config.json
) into the folder where your script lives, or one of its parent folders. You can use
parent folders up to the last parent that is still included in the PYTHONPATH
environment variable: hi-ml
will
try to be smart and search through all folders that it thinks belong to your current project.
Using the AzureML integration layer
Consider a simple use case, where you have a Python script that does something - this could be training a model,
or pre-processing some data. The hi-ml
package can help easily run that on Azure Machine Learning (AML) services.
Here is an example script that reads images from a folder, resizes and saves them to an output folder:
from pathlib import Path
if __name__ == '__main__':
input_folder = Path("/tmp/my_dataset")
output_folder = Path("/tmp/my_output")
for file in input_folder.glob("*.jpg"):
contents = read_image(file)
resized = contents.resize(0.5)
write_image(output_folder / file.name)
Doing that at scale can take a long time. We’d like to run that script in AzureML, consume the data from a folder in blob storage, and write the results back to blob storage, so that we can later use it as an input for model training.
You can achieve that by adding a call to submit_to_azure_if_needed
from the hi-ml
package:
from pathlib import Path
from health_azure import submit_to_azure_if_needed
if __name__ == '__main__':
current_file = Path(__file__)
run_info = submit_to_azure_if_needed(compute_cluster_name="preprocess-ds12",
input_datasets=["images123"],
# Omit this line if you don't create an output dataset (for example, in
# model training scripts)
output_datasets=["images123_resized"],
default_datastore="my_datastore")
# When running in AzureML, run_info.input_datasets and run_info.output_datasets will be populated,
# and point to the data coming from blob storage. For runs outside AML, the paths will be None.
# Replace the None with a meaningful path, so that we can still run the script easily outside AML.
input_dataset = run_info.input_datasets[0] or Path("/tmp/my_dataset")
output_dataset = run_info.output_datasets[0] or Path("/tmp/my_output")
files_processed = []
for file in input_dataset.glob("*.jpg"):
contents = read_image(file)
resized = contents.resize(0.5)
write_image(output_dataset / file.name)
files_processed.append(file.name)
# Any other files that you would not consider an "output dataset", like metrics, etc, should be written to
# a folder "./outputs". Any files written into that folder will later be visible in the AzureML UI.
# run_info.output_folder already points to the correct folder.
stats_file = run_info.output_folder / "processed_files.txt"
stats_file.write_text("\n".join(files_processed))
Once these changes are in place, you can submit the script to AzureML by supplying an additional --azureml
flag
on the commandline, like python myscript.py --azureml
.
Note that you do not need to modify the argument parser of your script to recognize the --azureml
flag.
Essential arguments to submit_to_azure_if_needed
When calling submit_to_azure_if_needed
, you can to supply the following parameters:
compute_cluster_name
(Mandatory): The name of the AzureML cluster that should run the job. This can be a cluster with CPU or GPU machines. See here for documentationentry_script
: The script that should be run. If omitted, thehi-ml
package will assume that you would like to submit the script that is presently running, given insys.argv[0]
.snapshot_root_directory
: The directory that contains all code that should be packaged and sent to AzureML. All Python code that the script uses must be copied over. This defaults to the current working directory, but can be one of its parents. If you would like to explicitly skip some folders inside thesnapshot_root_directory
, then useignored_folders
to specify those.conda_environment_file
: The conda configuration file that describes which packages are necessary for your script to run. If omitted, thehi-ml
package searches for a file calledenvironment.yml
in the current folder or its parents.
You can also supply an input dataset. For data pre-processing scripts, you can add an output dataset (omit this for ML training scripts).
To use datasets, you need to provision a data store in your AML workspace, that points to your training data in blob storage. This is described here.
input_datasets=["images123"]
in the code above means that the script will consume all data in folderimages123
in blob storage as the input. The folder must exist in blob storage, in the location that you gave when creating the datastore. Once the script has run, it will also register the data in this folder as an AML dataset.output_datasets=["images123_resized"]
means that the script will create a temporary folder when running in AML, and while the job writes data to that folder, upload it to blob storage, in the data store.
For more examples, please see examples.md. For more details about datasets, see here.
Additional arguments you should know about
submit_to_azure_if_needed
has a large number of arguments, please check the
API documentation for an exhaustive list.
The particularly helpful ones are listed below.
experiment_name
: All runs in AzureML are grouped in “experiments”. By default, the experiment name is determined by the name of the script you submit, but you can specify a name explicitly with this argument.environment_variables
: A dictionary with the contents of all environment variables that should be set inside the AzureML run, before the script is started.docker_base_image
: This specifies the name of the Docker base image to use for creating the Python environment for your script. The amount of memory to allocate for Docker is given bydocker_shm_size
.num_nodes
: The number of nodes on which your script should run. This is essential for distributed training.tags
: A dictionary mapping from string to string, with additional tags that will be stored on the AzureML run. This is helpful to add metadata about the run for later use.
Conda environments, Alternate pips, Private wheels
The function submit_to_azure_if_needed
tries to locate a Conda environment file in the current folder,
or in the Python path, with the name environment.yml
. The actual Conda environment file to use can be specified
directly with:
run_info = submit_to_azure_if_needed(
...
conda_environment_file=conda_environment_file,
where conda_environment_file
is a pathlib.Path
or a string identifying the Conda environment file to use.
The basic use of Conda assumes that packages listed are published Conda packages or published Python packages on PyPI. However, during development, the Python package may be on Test.PyPI, or in some other location, in which case the alternative package location can be specified directly with:
run_info = submit_to_azure_if_needed(
...
pip_extra_index_url="https://test.pypi.org/simple/",
Finally, it is possible to use a private wheel, if the package is only available locally with:
run_info = submit_to_azure_if_needed(
...
private_pip_wheel_path=private_pip_wheel_path,
where private_pip_wheel_path
is a pathlib.Path
or a string identifying the wheel package to use. In this case,
this wheel will be copied to the AzureML environment as a private wheel.
Setting up Azure
If you already have an AzureML workspace available, you can go straight to the last step.
To set up all your Azure resources, you need to have:
A subscription to Azure.
An account that has “Owner” permissions to the Azure subscription.
There are two ways to set up all necessary resources, either via the Azure portal or via the Azure Command-line Interface (CLI). We recommend the CLI because all necessary resources can be easily created via a single script.
Creating an AzureML workspace via the Azure Portal
If you prefer to create your workspace via the web UI on the Azure Portal, please follow the steps below.
Creating an AzureML workspace via the Azure Command-line Tools
A pureley command-line driven setup is possible via the Azure Command-line Tools. These tools are available for multiple platforms, including Linux, Mac, and Windows.
After downloading the command-line tools, you can run the following command add the ml
extension that is required to create an AzureML workspace:
az extension add --name ml
Documentation
Collecting the necessary information
Find out which Azure data centre locations you can use:
az account list-locations -o table
You will need the location names (second column in the table) to create resources in the right geographical regions. Choosing the right region can be particularly important if your data governance requires the data to be processed inside certain geographical boundaries.
For the storage account, please choose an SKU from based on your needs as described here. Most likely, Standard_LRS
will be the right SKU for you.
Creating resources
The script below will create
an AzureML workspace.
a storage account to hold the datasets, with a container called
datasets
.a datastore that links the AzureML workspace to the storage account.
a resource group for all of the above items.
In the script, you will need to replace the values of the following variables:
location
- the location of the Azure datacenter you want to use.prefix
- the prefix you want to use for the resources you create. This will also be the name of the AzureML workspace.
export location=uksouth # The Azure location where the resources should be created
export prefix=himl # The name of the AzureML workspace. This is also the prefix for all other resources.
export container=datasets
export datastorefile=datastore.yaml
az group create \
--name ${prefix}rg \
--location ${location}
az storage account create \
--name ${prefix}data \
--resource-group ${prefix}rg \
--location ${location} \
--sku Standard_LRS
az storage container create \
--account-name ${prefix}data \
--name ${container} \
--auth-mode
az ml workspace create \
--resource-group ${prefix}rg \
--name ${prefix} \
--location ${location}
key=$(az storage account keys list --resource-group ${prefix}rg --account-name ${prefix}data --query [0].value -o tsv)
cat >${datastorefile} <<EOL
\$schema: https://azuremlschemas.azureedge.net/latest/azureBlob.schema.json
name: datasets
type: azure_blob
description: Pointing to the `${container}` container in the ${prefix}data storage account.
account_name: ${prefix}data
container_name: ${container}
credentials:
account_key: ${key}
EOL
az ml datastore create --file ${datastorefile} --resource-group ${prefix}rg --workspace-name ${prefix}
rm ${datastorefile}
Note that the datastore will use the storage account key to authenticate. If you want to use Shared Access Signature (SAS) instead, replace the creation of the datastore config file in the above script with the following command:
key=$(az storage container generate-sas --account-name ${prefix}data --name ${container} --permissions acdlrw --https-only --expiry 2024-01-01 -o tsv)
cat >${datastorefile} <<EOL
\$schema: https://azuremlschemas.azureedge.net/latest/azureBlob.schema.json
name: ${name}
type: azure_blob
description: Pointing to the `${container}` container in the ${prefix}data storage account.
account_name: ${prefix}data
container_name: ${container}
credentials:
sas_token: ${key}
EOL
You can adjust the expiry date of the SAS token and the permissions of the SAS token (full read/write permission
in the script above). For further options, run az storage container generate-sas --help
Creating compute clusters and permissions
Now that you have created the core AzureML workspace, you need to create a compute cluster.
To adjust permissions, find the AzureML workspace that you just created in the Azure Portal. Add yourself and your team members with “Contributor” permissions to the workspace, following the guidelines here.
Accessing the workspace
The hi-ml
toolbox relies on a workspace configuration file called config.json
to access the right AzureML workspace.
This file can be downloaded from the UI of the workspace. It needs to be placed either in your copy of the hi-ml
repository,
or in your repository that uses the hi-ml
package.
In the browser, navigate to the AzureML workspace that you want to use for running your training job.
In the top right section, there will be a dropdown menu showing the name of your AzureML workspace. Expand that.
In the panel, there is a link “Download config file”. Click that.
This will download a file
config.json
. Copy that file to the root folder of your repository.
The file config.json
should look like this:
{
"subscription_id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"resource_group": "myresourcegroup",
"workspace_name": "myworkspace"
}
As an alternative to keeping the config.json
file in your repository, you can specify the
necessary information in environment variables. The environment variables are:
HIML_SUBSCRIPTION_ID
: The subscription ID of the AzureML workspace, taken from thesubscription_id
field in theconfig.json
file.HIML_RESOURCE_GROUP
: The resource group of the AzureML workspace, taken from theresource_group
field in theconfig.json
file.HIML_WORKSPACE_NAME
: The name of the AzureML workspace, taken from theworkspace_name
field in theconfig.json
file.
When accessing the workspace, the hi-ml
toolbox will first look for the config.json
file. If it is not found, it
will fall back to the environment variables. For details, see the documentation of the get_workspace
function in
readthedocs.
Connecting to Azure
Authentication
The hi-ml
package uses two possible ways of authentication with Azure.
The default is what is called “Interactive Authentication”. When you submit a job to Azure via hi-ml
, this will
use the credentials you used in the browser when last logging into Azure. If there are no credentials yet, you should
see instructions printed out to the console about how to log in using your browser.
We recommend using Interactive Authentication.
Alternatively, you can use a so-called Service Principal, for example within build pipelines.
Service Principal Authentication
A Service Principal is a form of generic identity or machine account. This is essential if you would like to submit training runs from code, for example from within an Azure pipeline. You can find more information about application registrations and service principal objects here.
If you would like to use Service Principal, you will need to create it in Azure first, and then store 3 pieces of information in 3 environment variables — please see the instructions below. When all the 3 environment variables are in place, your Azure submissions will automatically use the Service Principal to authenticate.
Creating the Service Principal
Navigate back to aka.ms/portal
Navigate to
App registrations
(use the top search bar to find it).Click on
+ New registration
on the top left of the page.Choose a name for your application e.g.
MyServicePrincipal
and clickRegister
.Once it is created you will see your application in the list appearing under
App registrations
. This step might take a few minutes.Click on the resource to access its properties. In particular, you will need the application ID. You can find this ID in the
Overview
tab (accessible from the list on the left of the page).Create an environment variable called
HIML_SERVICE_PRINCIPAL_ID
, and set its value to the application ID you just saw.You need to create an application secret to access the resources managed by this service principal. On the pane on the left find
Certificates & Secrets
. Click on+ New client secret
(bottom of the page), note down your token. Warning: this token will only appear once at the creation of the token, you will not be able to re-display it again later.Create an environment variable called
HIML_SERVICE_PRINCIPAL_PASSWORD
, and set its value to the token you just added.
Providing permissions to the Service Principal
Now that your service principal is created, you need to give permission for it to access and manage your AzureML workspace. To do so:
Go to your AzureML workspace. To find it you can type the name of your workspace in the search bar above.
On the
Overview
page, there is a link to the Resource Group that contains the workspace. Click on that.When on the Resource Group, navigate to
Access control
. Then click on+ Add
>Add role assignment
. A pane will appear on the the right. SelectRole > Contributor
. In theSelect
field type the name of your Service Principal and select it. Finish by clickingSave
at the bottom of the pane.
Azure Tenant ID
The last remaining piece is the Azure tenant ID, which also needs to be available in an environment variable. To get that ID:
Log into Azure
Via the search bar, find “Azure Active Directory” and open it.
In the overview of that, you will see a field “Tenant ID”
Create an environment variable called
HIML_TENANT_ID
, and set that to the tenant ID you just saw.
Datasets
Key concepts
We’ll first outline a few concepts that are helpful for understanding datasets.
Blob Storage
Firstly, there is Azure Blob Storage.
Each blob storage account has multiple containers - you can think of containers as big disks that store files.
The hi-ml
package assumes that your datasets live in one of those containers, and each top level folder corresponds
to one dataset.
AzureML Data Stores
Secondly, there are data stores. This is a concept coming from Azure Machine Learning, described here. Data stores provide access to one blob storage account. They exist so that the credentials to access blob storage do not have to be passed around in the code - rather, the credentials are stored in the data store once and for all.
You can view all data stores in your AzureML workspace by clicking on one of the bottom icons in the left-hand navigation bar of the AzureML studio.
One of these data stores is designated as the default data store.
AzureML Datasets
Thirdly, there are datasets. Again, this is a concept coming from Azure Machine Learning. A dataset is defined by
A data store
A set of files accessed through that data store
You can view all datasets in your AzureML workspace by clicking on one of the icons in the left-hand navigation bar of the AzureML studio.
Preparing data
To simplify usage, the hi-ml
package creates AzureML datasets for you. All you need to do is to
Create a blob storage account for your data, and within it, a container for your data.
Create a data store that points to that storage account, and store the credentials for the blob storage account in it
From that point on, you can drop a folder of files in the container that holds your data. Within the hi-ml
package,
just reference the name of the folder, and the package will create a dataset for you, if it does not yet exist.
Using the datasets
The simplest way of specifying that your script uses a folder of data from blob storage is as follows: Add the
input_datasets
argument to your call of submit_to_azure_if_needed
like this:
from health_azure import submit_to_azure_if_needed
run_info = submit_to_azure_if_needed(...,
input_datasets=["my_folder"],
default_datastore="my_datastore",
)
input_folder = run_info.input_datasets[0]
What will happen under the hood?
The toolbox will check if there is already an AzureML dataset called “my_folder”. If so, it will use that. If there is no dataset of that name, it will create one from all the files in blob storage in folder “my_folder”. The dataset will be created using the data store provided, “my_datastore”.
Once the script runs in AzureML, it will download the dataset “my_folder” to a temporary folder.
You can access this temporary location by
run_info.input_datasets[0]
, and read the files from it.
More complicated setups are described below.
Input and output datasets
Any run in AzureML can consume a number of input datasets. In addition, an AzureML run can also produce an output dataset (or even more than one).
Output datasets are helpful if you would like to run, for example, a script that transforms one dataset into another.
You can use that via the output_datasets
argument:
from health_azure import submit_to_azure_if_needed
run_info = submit_to_azure_if_needed(...,
input_datasets=["my_folder"],
output_datasets=["new_dataset"],
default_datastore="my_datastore",
)
input_folder = run_info.input_datasets[0]
output_folder = run_info.output_datasets[0]
Your script can now read files from input_folder
, transform them, and write them to output_folder
. The latter
will be a folder on the temp file system of the machine. At the end of the script, the contents of that temp folder
will be uploaded to blob storage, and registered as a dataset.
Mounting and downloading
An input dataset can be downloaded before the start of the actual script run, or it can be mounted. When mounted, the files are accessed via the network once needed - this is very helpful for large datasets where downloads would create a long waiting time before the job start.
Similarly, an output dataset can be uploaded at the end of the script, or it can be mounted. Mounting here means that all files will be written to blob storage already while the script runs (rather than at the end).
Note: If you are using mounted output datasets, you should NOT rename files in the output folder.
Mounting and downloading can be triggered by passing in DatasetConfig
objects for the input_datasets
argument,
like this:
from health_azure import DatasetConfig, submit_to_azure_if_needed
input_dataset = DatasetConfig(name="my_folder", datastore="my_datastore", use_mounting=True)
output_dataset = DatasetConfig(name="new_dataset", datastore="my_datastore", use_mounting=True)
run_info = submit_to_azure_if_needed(...,
input_datasets=[input_dataset],
output_datasets=[output_dataset],
)
input_folder = run_info.input_datasets[0]
output_folder = run_info.output_datasets[0]
Local execution
For debugging, it is essential to have the ability to run a script on a local machine, outside of AzureML. Clearly, your script needs to be able to access data in those runs too.
There are two ways of achieving that: Firstly, you can specify an equivalent local folder in the
DatasetConfig
objects:
from pathlib import Path
from health_azure import DatasetConfig, submit_to_azure_if_needed
input_dataset = DatasetConfig(name="my_folder",
datastore="my_datastore",
local_folder=Path("/datasets/my_folder_local"),
)
run_info = submit_to_azure_if_needed(...,
input_datasets=[input_dataset],
)
input_folder = run_info.input_datasets[0]
Secondly, if local_folder
is not specified, then the dataset will either be downloaded or mounted to a temporary folder locally, depending on the use_mounting
flag. The path to it will be available in run_info
as above.
input_folder = run_info.input_datasets[0]
Note that mounting the dataset locally is only supported on Linux because it requires the use of the native package libfuse, which must first be installed. Also, if running in a Docker container, it must be started with additional arguments. For more details see here: azureml.data.filedataset.mount.
Making a dataset available at a fixed folder location
Occasionally, scripts expect the input dataset at a fixed location, for example, data is always read from /tmp/mnist
.
AzureML has the capability to download/mount a dataset to such a fixed location. With the hi-ml
package, you can
trigger that behaviour via an additional option in the DatasetConfig
objects:
from health_azure import DatasetConfig, submit_to_azure_if_needed
input_dataset = DatasetConfig(name="my_folder",
datastore="my_datastore",
use_mounting=True,
target_folder="/tmp/mnist",
)
run_info = submit_to_azure_if_needed(...,
input_datasets=[input_dataset],
)
# Input_folder will now be "/tmp/mnist"
input_folder = run_info.input_datasets[0]
This is also true when running locally - if local_folder
is not specified and an AzureML workspace can be found, then the dataset will be downloaded or mounted to the target_folder
.
Overwriting existing output datasets
When creating an output dataset with the same name as an existing dataset, the default behaviour of hi-ml
is to overwrite the existing datasets. This is as if a run fails during the upload stage, corrupt files may be created. Allowing overwriting means that these corrupt datasets will not cause errors. If you wish to disable this behaviour, it can be controlled using the overwrite_existing
parameter (only available in sdk v1, hence setting strictly_aml_v1=True
):
from health_azure import DatasetConfig, submit_to_azure_if_needed
output_dataset = DatasetConfig(name="my_folder",
datastore="my_datastore",
overwrite_existing=False,
)
# fails if output dataset already exists:
run_info = submit_to_azure_if_needed(...,
output_datasets=[output_dataset],
strictly_aml_v1=True,
)
Dataset versions
AzureML datasets can have versions, starting at 1. You can view the different versions of a dataset in the AzureML
workspace. In the hi-ml
toolbox, you would always use the latest version of a dataset unless specified otherwise.
If you do need a specific version, use the version
argument in the DatasetConfig
objects:
from health_azure import DatasetConfig, submit_to_azure_if_needed
input_dataset = DatasetConfig(name="my_folder",
datastore="my_datastore",
version=7,
)
run_info = submit_to_azure_if_needed(...,
input_datasets=[input_dataset],
)
input_folder = run_info.input_datasets[0]
Hyperparameter Search via Hyperdrive (AML SDK v1)
HyperDrive runs
can start multiple AzureML jobs in parallel. This can be used for tuning hyperparameters, or executing multiple
training runs for cross validation. To use that with the hi-ml
package, simply supply a HyperDrive configuration
object as an additional argument. Note that this object needs to be created with an empty run_config
argument (this
will later be replaced with the correct run_config
that submits your script.)
The example below shows a hyperparameter search that aims to minimize the validation loss val_loss
, by choosing
one of three possible values for the learning rate commandline argument learning_rate
.
from azureml.core import ScriptRunConfig
from azureml.train.hyperdrive import GridParameterSampling, HyperDriveConfig, PrimaryMetricGoal, choice
from health_azure import submit_to_azure_if_needed
hyperdrive_config = HyperDriveConfig(
run_config=ScriptRunConfig(source_directory=""),
hyperparameter_sampling=GridParameterSampling(
parameter_space={
"learning_rate": choice([0.1, 0.01, 0.001])
}),
primary_metric_name="val_loss",
primary_metric_goal=PrimaryMetricGoal.MINIMIZE,
max_total_runs=5
)
submit_to_azure_if_needed(..., hyperdrive_config=hyperdrive_config)
For further examples, please check the example scripts here, and the HyperDrive documentation.
Hyperparameter Search in AML SDK v2
There is no concept of a HyperDriveConfig in AML SDK v2. Instead, hyperparameter search arguments are passed into a
command, and then the ‘sweep’ method is called AML
docs. To specify a hyperparameter
search job you must specify the method get_parameter_tuning_args
in your Container. This should return a dictionary of
the arguments to be passed in to the command. For example:
def get_parameter_tuning_args(self) -> Dict[str, Any]:
from azure.ai.ml.entities import Choice
from health_azure.himl import (MAX_TOTAL_TRIALS_ARG, PARAM_SAMPLING_ARG, SAMPLING_ALGORITHM_ARG,
PRIMARY_METRIC_ARG, GOAL_ARG)
values = [0.1, 0.5, 0.9]
argument_name = "learning_rate"
param_sampling = {argument_name: Choice(values)}
metric_name = "val/loss"
hparam_args = {
MAX_TOTAL_TRIALS_ARG: len(values),
PARAM_SAMPLING_ARG: param_sampling,
SAMPLING_ALGORITHM_ARG: "grid",
PRIMARY_METRIC_ARG: metric_name,
GOAL_ARG: "Minimize"
}
return hparam_args
Additional parameters, sampling strategies, limits etc. are described in the link above. Note that each job that is
created will receive an additional command line argument <argument_name>
and it is your job to update the script to be
able to parse and use this argument.
Using Cheap Low Priority VMs
By using Low Priority machines in AzureML, we can run training at greatly reduced costs (around 20% of the original price, see references below for details). This comes with the risk, though, of having the job interrupted and later re-started. This document describes the inner workings of Low Priority compute, and how to best make use of it.
Because the jobs can get interrupted, low priority machines are not suitable for production workload where time is critical. They do offer a lot of benefits though for long-running training jobs or large scale experimentation, that would otherwise be expensive to carry out.
Setting up the Compute Cluster
Jobs in Azure Machine Learning run in a “compute cluster”. When creating a compute cluster, we can specify the size of the VM, the type and number of GPUs, etc. Doing this via the AzureML UI is described here. Doing it programmatically is described here
One of the setting to tweak when creating the compute cluster is whether the machines are “Dedicated” or “Low Priority”:
Dedicated machines will be permanently allocated to your compute cluster. The VMs in a dedicated cluster will be always available, unless the cluster is set up in a way that it removes idle machine. Jobs will not be interrupted.
Low priority machines effectively make use of spare capacity in the data centers, you can think of them as “dedicated machines that are presently idle”. They are available at a much lower price (around 20% of the price of a dedicated machine). These machines are made available to you until they are needed as dedicated machines somewhere else.
In order to get a compute cluster that operates at the lowest price point, choose
Low priority machines
Set “Minimum number of nodes” to 0, so that the cluster removes all idle machines if no jobs are running.
For details on pricing, check this Azure price calculator, choose “Category: GPU”. The price for low priority VMs is given in the “Spot” column
Behaviour of Low Priority VMs
Jobs can be interrupted at any point, this is called “low priority preemption”. When interrupted, the job stops - there
is no signal that we can make use of to do cleanup or something. All the files that the job has produced up to that
point in the outputs
and logs
folders will be saved to the cloud.
At some later point, the job will be assigned a virtual machine again. When re-started, all the files that the job had
produced in its previous run will be available on disk again where they were before interruption, mounted at the same
path. That is, if the interrupted job wrote a file outputs/foo.txt
, this file will be accessible as outputs/foo.txt
also after the restart.
Note that all AzureML-internal log files that the job produced in a previous run will be overwritten
(this behaviour may change in the future). That is in contrast to the behaviour for metrics that the interrupted job had
saved to AzureML already (for example, metrics written by a call like Run.log("loss", loss_tensor.item())
):
Those metrics are already stored in AzureML, and will still be there when the job restarts. The re-started job will
then append to the metrics that had been written in the previous run. This typically shows as sudden jumps in
metrics, as illustrated here:
In this example, the learning rate was increasing for the first 6 or so epochs. Then the job got preempted, and started
training from scratch, with the initial learning rate and schedule. Note that this behaviour is only an artifact of how
the metrics are stored in AzureML, the actual training is doing the right thing.
How do you verify that your job got interrupted? Usually, you would see a warning displayed on the job page in the AzureML UI, that says something along the lines of “Low priority compute preemption warning: a node has been preempted.” . You can use kinks in metrics as another indicator that your job got preempted: Sudden jumps in metrics after which the metric follows a shape similar to the one at job start usually indicates low priority preemption.
Note that a job can be interrupted more than one time.
Best Practice Guide for Your Jobs
In order to make best use of low priority compute, your code needs to be made resilient to restarts. Essentially, this means that it should write regular checkpoints, and try to use those checkpoint files if they already exist. Examples of how to best do that are given below.
In addition, you need to bear in mind that the job can be interrupted at any moment, for example when it is busy uploading huge checkpoint files to Azure. When trying to upload again after restart, there can be resource collisions.
Writing and Using Recovery Checkpoints
When using PyTorch Lightning, you can add a checkpoint callback to your trainer, that ensures that you save the model
and optimizer to disk in regular intervals. This callback needs to be added to your Trainer
object. Note that these
recovery checkpoints need to be written to the outputs
folder, because only files in this folder get saved to Azure
automatically when the job gets interrupted.
When starting training, your code needs to check if there is already a recovery checkpoint present on disk. If so, training should resume from that point.
Here is a code snippet that illustrates all that:
import re
from pathlib import Path
import numpy as np
from health_ml.utils import AzureMLLogger
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
RECOVERY_CHECKPOINT_FILE_NAME = "recovery_"
CHECKPOINT_FOLDER = "outputs/checkpoints"
def get_latest_recovery_checkpoint():
all_recovery_files = [f for f in Path(CHECKPOINT_FOLDER).glob(RECOVERY_CHECKPOINT_FILE_NAME + "*")]
if len(all_recovery_files) == 0:
return None
# Get recovery checkpoint with highest epoch number
recovery_epochs = [int(re.findall(r"[\d]+", f.stem)[0]) for f in all_recovery_files]
idx_max_epoch = int(np.argmax(recovery_epochs))
return str(all_recovery_files[idx_max_epoch])
recovery_checkpoint = ModelCheckpoint(dirpath=CHECKPOINT_FOLDER,
filename=RECOVERY_CHECKPOINT_FILE_NAME + "{epoch}",
period=10)
trainer = Trainer(default_root_dir="outputs",
callbacks=[recovery_checkpoint],
logger=[AzureMLLogger()],
resume_from_checkpoint=get_latest_recovery_checkpoint())
Additional Optimizers and Other State
In order to be resilient to interruption, your jobs need to save all their state to disk. In PyTorch Lightning training,
this would include all optimizers that you are using. The “normal” optimizer for model training is saved to the
checkpoint by Lightning already. However, you may be using callbacks or other components that maintain state. As an
example, training a linear head for self-supervised learning can be done in a callback, and that callback can have its
own optimizer. Such callbacks need to correctly implement the on_save_checkpoint
method to save their state to the
checkpoint, and on_load_checkpoint
to load it back in.
For more information about persisting state, check the PyTorch Lightning documentation .
Commandline tools
Run TensorBoard
From the command line, run the command
himl-tb
specifying one of
[--experiment] [--latest_run_file] [--run]
This will start a TensorBoard session, by default running on port 6006. To use an alternative port, specify this with --port
.
If --experiment
is provided, the most recent Run from this experiment will be visualised.
If --latest_run_file
is provided, the script will expect to find a RunId in this file.
Alternatively you can specify the Runs to visualise via --run
. This can be a single run id, or multiple ids separated by commas. This argument also accepts one or more run recovery ids, although these are not recommended since it is no longer necessary to provide an experiment name in order to recovery an AML Run.
By default, this tool expects that your TensorBoard logs live in a folder named ‘logs’ and will create a similarly named folder in your root directory. If your TensorBoard logs are stored elsewhere, you can specify this with the --log_dir
argument.
If you choose to specify --experiment
, you can also specify --num_runs
to view and/or --tags
to filter by.
If your AML config path is not ROOT_DIR/config.json, you must also specify --config_file
.
To see an example of how to create TensorBoard logs using PyTorch on AML, see the AML submitting script which submits the following pytorch sample script. Note that to run this, you’ll need to create an environment with pytorch and tensorboard as dependencies, as a minimum. See an example conda environemnt. This will create an experiment named ‘tensorboard_test’ on your Workspace, with a single run. Go to outputs + logs -> outputs to see the tensorboard events file.
Download files from AML Runs
From the command line, run the command
himl-download
specifying one of
[--experiment] [--latest_run_file] [--run]
If --experiment
is provided, the most recent Run from this experiment will be downloaded.
If --latest_run_file
is provided, the script will expect to find a RunId in this file.
Alternatively you can specify the Run to download via --run
. This can be a single run id, or multiple ids separated by commas. This argument also accepts one or more run recovery ids, although these are not recommended since it is no longer necessary to provide an experiment name in order to recovery an AML Run.
The files associated with your Run will be downloaded to the location specified with --output_dir
(by default ROOT_DIR/outputs)
If you choose to specify --experiment
, you can also specify --tags
to filter by.
If your AML config path is not ROOT_DIR/config.json
, you must also specify --config_file
.
Creating your own command line tools
When creating your own command line tools that interact with the Azure ML ecosystem, you may wish to use the
AmlRunScriptConfig
class for argument parsing. This gives you a quickstart way for accepting command line arguments to
specify the following
experiment: a string representing the name of an Experiment, from which to retrieve AML runs
tags: to filter the runs within the given experiment
num_runs: to define the number of most recent runs to return from the experiment
run: to instead define one or more run ids from which to retrieve runs (also supports the older format of run recovery ideas although these are obsolete now)
latest_run_file: to instead provide a path to a file containing the id of your latest run, for retrieval.
config_path: to specify a config.json file in which your workspace settings are defined
You can extend this list of arguments by creating a child class that inherits from AMLRunScriptConfig.
Defining your own argument types
Additional arguments can have any of the following types: bool
, integer
, float
, string
, list
, class/class instance
with no additional work required. You can also define your own custom type, by providing a custom class in your code that
inherits from CustomTypeParam
. It must define 2 methods:
_validate(self, x: Any)
: which should raise aValueError
if x is not of the type you expect, and should also make a callsuper()._validate(val)
from_string(self, y: string)
which takes in the command line arg as a string (y
) and returns an instance of the type that you want. For example, if your custom type is a tuple, this method should create a tuple from the input string and return that. An example of a custom type can be seen in our own custom type:RunIdOrListParam
, which accepts a string representing one or more run ids (or run recovery ids) and returns either a List or a single RunId object (or RunRecoveryId object if appropriate)
Example:
class EvenNumberParam(util.CustomTypeParam):
""" Our custom type param for even numbers """
def _validate(self, val: Any) -> None:
if (not self.allow_None) and val is None:
raise ValueError("Value must not be None")
if val % 2 != 0:
raise ValueError(f"{val} is not an even number")
super()._validate(val) # type: ignore
def from_string(self, x: str) -> int:
return int(x)
class MyScriptConfig(util.AmlRunScriptConfig):
# example of a generic param
simple_string: str = param.String(default="")
# example of a custom param
even_number = EvenNumberParam(2, doc="your choice of even number")
Downloading from/ uploading to Azure ML
All of the below functions will attempt to find a current workspace, if running in Azure ML, or else will attempt to locate ‘config.json’ file in the current directory, and its parents. Alternatively, you can specify your own Workspace object or a path to a file containing the workspace settings.
Download files from an Azure ML Run
To download all files from an AML Run, given its run id, perform the following:
from pathlib import Path
from health_azure import download_files_from_run_id
run_id = "example_run_id_123"
output_folder = Path("path/to/save")
download_files_from_run_id(run_id, output_folder)
Here, “path_to_save” represents the folder in which we want the downloaded files to be stored. E.g. if your run contains the files [“abc/def/1.txt”, “abc/2.txt”] and you specify the prefix “abc” and the output_folder “my_outputs”, you’ll end up with the files [“my_outputs/abc/def/1.txt”, “my_outputs/abc/2.txt”]
If you wish to specify the file name(s) to be downloaded, you can do so with the “prefix” parameter. E.g. prefix=”outputs” will download all files within the “output” folder, if such a folder exists within your Run.
There is an additional parameter, “validate_checksum” which defaults to False. If True, will validate MD5 hash of the data arriving (in chunks) to that being sent.
Note that if your code is running in a distributed manner, files will only be downloaded onto nodes with local rank = 0. E.g. if you have 2 nodes each running 4 processes, the file will be downloaded by CPU/GPU 0 on each of the 2 nodes. All processes will be synchronized to only exit the downloading method once it has completed on all nodes/ranks.
Downloading checkpoint files from a run
To download checkpoint files from an Azure ML Run, perform the following:
from pathlib import Path
from health_azure import download_checkpoints_from_run_id
download_checkpoints_from_run_id("example_run_id_123", Path("path/to/checkpoint/directory"))
All files within the checkpoint directory will be downloaded into the folder specified by “path/to/checkpoint_directory”.
Since checkpoint files are often large and therefore prone to corruption during download, by default, this function will validate the MD5 hash of the data downloaded (in chunks) compared to that being sent.
Note that if your code is running in a distributed manner, files will only be downloaded onto nodes with local rank = 0. E.g. if you have 2 nodes each running 4 processes, the file will be downloaded by CPU/GPU 0 on each of the 2 nodes. All processes will be synchronized to only exit the downloading method once it has completed on all nodes/ranks.
Downloading files from an Azure ML Datastore
To download data from an Azure ML Datastore within your Workspace, follow this example:
from pathlib import Path
from health_azure import download_from_datastore
download_from_datastore("datastore_name", "prefix", Path("path/to/output/directory") )
where “prefix” represents the path to the file(s) to be downloaded, relative to the datastore “datastore_name”.
Azure will search for files within the Datastore whose paths begin with this string.
If you wish to download multiple files from the same folder, set
This function takes additional parameters “overwrite” and “show_progress”. If True, overwrite will overwrite any existing local files with the same path. If False and there is a duplicate file, it will skip this file. If show_progress is set to True, the progress of the file download will be visible in the terminal.
Uploading files to an Azure ML Datastore
To upload data to an Azure ML Datastore within your workspace, perform the following:
from pathlib import Path
from health_azure import upload_to_datastore
upload_to_datastore("datastore_name", Path("path/to/local/data/folder"), Path("path/to/datastore/folder") )
Where “datastore_name” is the name of the registered Datastore within your workspace that you wish to upload to and “path/to/datastore/folder” is the relative path within this Datastore that you wish to upload data to. Note that the path to local data must be a folder, not a single path. The folder name will not be included in the remote path. E.g. if you specify the local_data_dir=”foo/bar” and that contains the files [“1.txt”, “2.txt”], and you specify the remote_path=”baz”, you would see the following paths uploaded to your Datastore: [“baz/1.txt”, “baz/2.txt”]
This function takes additional parameters “overwrite” and “show_progress”. If True, overwrite will overwrite any existing remote files with the same path. If False and there is a duplicate file, it will skip this file. If show_progress is set to True, the progress of the file upload will be visible in the terminal.
Examples
Note: All examples below contain links to sample scripts that are also included in the repository.
The experience is optimized for use on readthedocs. When navigating to the sample scripts on the github UI,
you will only see the .rst
file that links to the .py
file. To access the .py
file, go to the folder that
contains the respective .rst
file.
Basic integration
The sample examples/1/sample.py is a script that takes an optional command line argument of a target value and prints all the prime numbers up to (but not including) this target. It is simply intended to demonstrate a long running operation that we want to run in Azure. Run it using e.g.
cd examples/1
python sample.py -n 103
The sample examples/2/sample.py shows the minimal modifications to run this in AzureML. Firstly create an AzureML workspace and download the config file, as explained here. The config file should be placed in the same folder as the sample script. A sample Conda environment file is supplied. Import the hi-ml package into the current environment. Finally add the following to the sample script:
from health_azure import submit_to_azure_if_needed
...
def main() -> None:
_ = submit_to_azure_if_needed(
compute_cluster_name="lite-testing-ds2",
wait_for_completion=True,
wait_for_completion_show_output=True)
Replace lite-testing-ds2
with the name of a compute cluster created within the AzureML workspace.
If this script is invoked as the first sample, e.g.
cd examples/2
python sample.py -n 103
then the output will be exactly the same. But if the script is invoked as follows:
cd examples/2
python sample.py -n 103 --azureml
then the function submit_to_azure_if_needed
will perform all the required actions to run this script in AzureML and exit. Note that:
code after
submit_to_azure_if_needed
is not run locally, but it is run in AzureML.the print statement prints to the AzureML console output and is available in the
Output + logs
tab of the experiment in the70_driver_log.txt
if using the old AzureML runtime, orstd_log.txt
if using the new AzureML runtime, and can be downloaded from there.the command line arguments are passed through (apart from –azureml) when running in AzureML.
a new file:
most_recent_run.txt
will be created containing an identifier of this AzureML run.
A sample script examples/2/results.py demonstrates how to programmatically download the driver log file.
Output files
The sample examples/3/sample.py demonstrates output file handling when running on AzureML. Because each run is performed in a separate VM or cluster then any file output is not generally preserved. In order to keep the output it should be written to the outputs
folder when running in AzureML. The AzureML infrastructure will preserve this and it will be available for download from the outputs
folder in the Output + logs
tab.
Make the following additions:
from health_azure import submit_to_azure_if_needed
run_info = submit_to_azure_if_needed(
...
parser.add_argument("-o", "--output", type=str, default="primes.txt", required=False, help="Output file name")
...
output = run_info.output_folder / args.output
output.write_text("\n".join(map(str, primes)))
When running locally submit_to_azure_if_needed
will create a subfolder called outputs
and then the output can be written to the file args.output
there. When running in AzureML the output will be available in the file args.output
in the Experiment.
A sample script examples/3/results.py demonstrates how to programmatically download the output file.
Output datasets
The sample examples/4/sample.py demonstrates output dataset handling when running on AzureML.
In this case, the following parameters are added to submit_to_azure_if_needed
:
from health_azure import submit_to_azure_if_needed
run_info = submit_to_azure_if_needed(
...
default_datastore="himldatasets",
output_datasets=["himl_sample4_output"],
The default_datastore
is required if using the simplest configuration for an output dataset, to just use the blob container name. There is an alternative that doesn’t require the default_datastore
and allows a different datastore for each dataset:
from health_azure import DatasetConfig, submit_to_azure_if_needed
...
run_info = submit_to_azure_if_needed(
...
output_datasets=[DatasetConfig(name="himl_sample4_output", datastore="himldatasets")]
Now the output folder is constructed as follows:
output_folder = run_info.output_datasets[0] or Path("outputs") / "himl_sample4_output"
output_folder.mkdir(parents=True, exist_ok=True)
output = output_folder / args.output
When running in AzureML run_info.output_datasets[0]
will be populated using the new parameter and the output will be written to that blob storage. When running locally run_info.output_datasets[0]
will be None and a local folder will be created and used.
A sample script examples/4/results.py demonstrates how to programmatically download the output dataset file.
For more details about datasets, see here
Input datasets
This example trains a simple classifier on a toy dataset, first creating the dataset files and then in a second script training the classifier.
The script examples/5/inputs.py is provided to prepare the csv files. Run the script to download the Iris dataset and create two CSV files:
cd examples/5
python inputs.py
The training script examples/5/sample.py is modified from https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/ml-frameworks/scikit-learn/train-hyperparameter-tune-deploy-with-sklearn/train_iris.py to work with input csv files. Start it to train the actual classifier, based on the data files that were just written:
cd examples/5
python sample.py
Including input files in the snapshot
When using very small datafiles (in the order of few MB), the easiest way to get the input data to Azure is to include them in the set of (source) files that are uploaded to Azure. You can run the dataset creation script on your local machine, writing the resulting two files to the same folder where your training script is located, and then submit the training script to AzureML. Because the dataset files are in the same folder, they will automatically be uploaded to AzureML.
However, it is not ideal to have the input files in the snapshot: The size of the snapshot is limited to 25 MB. It is better to put the data files into blob storage and use input datasets.
Creating the dataset in AzureML
The suggested way of creating a dataset is to run a script in AzureML that writes an output dataset. This is particularly important for large datasets, to avoid the usually low bandwith from a local machine to the cloud.
This is shown in examples/6/inputs.py:
This script prepares the CSV files in an AzureML run, and writes them to an output dataset called himl_sample6_input
.
The relevant code parts are:
run_info = submit_to_azure_if_needed(
compute_cluster_name="lite-testing-ds2",
default_datastore="himldatasets",
output_datasets=["himl_sample6_input"])
# The dataset files should be written into this folder:
dataset = run_info.output_datasets[0] or Path("dataset")
Run the script:
cd examples/6
python inputs.py --azureml
You can now modify the training script examples/6/sample.py to use the newly created dataset
himl_sample6_input
as an input. To do that, the following parameters are added to submit_to_azure_if_needed
:
run_info = submit_to_azure_if_needed(
compute_cluster_name="lite-testing-ds2",
default_datastore="himldatasets",
input_datasets=["himl_sample6_input"])
When running in AzureML, the dataset will be downloaded before running the job. You can access the temporary folder where the dataset is available like this:
input_folder = run_info.input_datasets[0] or Path("dataset")
The part behind the or
statement is only necessary to keep a reasonable behaviour when running outside of AzureML:
When running in AzureML run_info.input_datasets[0]
will be populated using input dataset specified in the call to
submit_to_azure_if_needed
, and the input will be downloaded from blob storage. When running locally
run_info.input_datasets[0]
will be None
and a local folder should be populated and used.
The default_datastore
is required if using the simplest configuration for an input dataset. There are
alternatives that do not require the default_datastore
and allows a different datastore for each dataset, for example:
from health_azure import DatasetConfig, submit_to_azure_if_needed
...
run_info = submit_to_azure_if_needed(
...
input_datasets=[DatasetConfig(name="himl_sample7_input", datastore="himldatasets"],
For more details about datasets, see here
Uploading the input files manually
An alternative to writing the dataset in AzureML (as suggested above) is to create them on the local machine, and upload them manually directly to Azure blob storage.
This is shown in examples/7/inputs.py: This script prepares the CSV files
and uploads them to blob storage, in a folder called himl_sample7_input
. Run the script:
cd examples/7
python inputs_via_upload.py
As in the above example, you can now modify the training script examples/7/sample.py to use
an input dataset that has the same name as the folder where the files just got uploaded. In this case, the following
parameters are added to submit_to_azure_if_needed
:
run_info = submit_to_azure_if_needed(
...
default_datastore="himldatasets",
input_datasets=["himl_sample7_input"],
Hyperdrive
The sample examples/8/sample.py demonstrates adding hyperparameter tuning. This shows the same hyperparameter search as in the AzureML sample.
Make the following additions:
from azureml.core import ScriptRunConfig
from azureml.train.hyperdrive import HyperDriveConfig, PrimaryMetricGoal, choice
from azureml.train.hyperdrive.sampling import RandomParameterSampling
...
def main() -> None:
param_sampling = RandomParameterSampling({
"--kernel": choice('linear', 'rbf', 'poly', 'sigmoid'),
"--penalty": choice(0.5, 1, 1.5)
})
hyperdrive_config = HyperDriveConfig(
run_config=ScriptRunConfig(source_directory=""),
hyperparameter_sampling=param_sampling,
primary_metric_name='Accuracy',
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
max_total_runs=12,
max_concurrent_runs=4)
run_info = submit_to_azure_if_needed(
...
hyperdrive_config=hyperdrive_config)
Note that this does not make sense to run locally, it should always be run in AzureML. When invoked with:
cd examples/8
python sample.py --azureml
this will perform a Hyperdrive run in AzureML, i.e. there will be 12 child runs, each randomly drawing from the parameter sample space. AzureML can plot the metrics from the child runs, but to do that, some small modifications are required.
Add in:
run = run_info.run
...
args = parser.parse_args()
run.log('Kernel type', np.str(args.kernel))
run.log('Penalty', np.float(args.penalty))
...
print('Accuracy of SVM classifier on test set: {:.2f}'.format(accuracy))
run.log('Accuracy', np.float(accuracy))
and these metrics will be displayed on the child runs tab in the Experiment page on AzureML.
Controlling when to submit to AzureML and when not
By default, the hi-ml
package assumes that you supply a commandline argument --azureml
(that can be anywhere on
the commandline) to trigger a submission of the present script to AzureML. If you wish to control it via a different
flag, coming out of your own argument parser, use the submit_to_azureml
argument of the function
health.azure.himl.submit_to_azure_if_needed
.
Training with k-fold cross validation in Azure ML
It is possible to create a parent run on Azure ML that is associated with one or more child runs (see here for further information.) This is useful in circumstances such as k-fold cross-validation, where individual child run perform validation on a different data split. When a HyperDriveRun is created in Azure ML, it follows this same principle and generates multiple child runs, associated with one parent.
To train with k-fold cross validation using submit_to_azure_if_needed
, you must do two things.
Call the helper function
create_crossval_hyperdrive_config
to create an AML HyperDriveConfig object representing your parent run. It will have one child run for each of the k-fold splits you request, as followsfrom health_azure import create_crossval_hyperdrive_config hyperdrive_config = create_crossval_hyperdrive_config(num_splits, cross_val_index_arg_name=cross_val_index_arg_name, metric_name=metric_name)
where:
num_splits
is the number of k-fold cross validation splits you requirecross_val_index_arg_name
is the name of the argument given to each child run, whose value denotes which split that child represents (this parameter defaults to ‘cross_validation_split_index’, in which case, supposing you specified 2 cross validation splits, one would receive the arguments [‘–cross_validation_split_index’ ‘0’] and the other would receive [‘–cross_validation_split_index’ ‘1’]]. It is up to you to then use these args to retrieve the correct split from your data.metrics_name
represents the name of a metric that you will compare your child runs by. NOTE the run will expect to find this metric, otherwise it will fail as described here You can log this metric in your training script as follows:
from azureml.core import Run # Example of logging a metric called <metric_name> to an AML Run. loss = <my_loss_calc> run_log = Run.get_context() run_log.log(metric_name, loss)
See the documentation here for further explanation.
The hyperdrive_config returned above must be passed into the function
submit_to_azure_if_needed
as follows:run_info = submit_to_azure_if_needed( ... hyperdrive_config=hyperdrive_config )
This will create a parent (HyperDrive) Run, with
num_cross_validation_split
children - each one associated with a different data split.
Retrieving the aggregated results of a cross validation/ HyperDrive run
You can retrieve a Pandas DataFrame of the aggregated results from your cross validation run as follows:
from health_azure import aggregate_hyperdrive_metrics
df = aggregate_hyperdrive_metrics(run_id, child_run_arg_name)
where:
run_id
is a string representing the id of your HyperDriveRun. Note that this must be an instance of an AML HyperDriveRun.child_run_arg_name
is a string representing the name of the argument given to each child run to denote its position relative to other child runs (e.g. this arg could equal ‘child_run_index’ - then each of your child runs should expect to receive the arg ‘–child_run_index’ with a value <= the total number of child runs)
If your HyperDrive run has 2 children, each logging the metrics epoch, accuracy and loss, the result would look like this:
| | 0 | 1 |
|--------------|-----------------|--------------------|
| epoch | [1, 2, 3] | [1, 2, 3] |
| accuracy | [0.7, 0.8, 0.9] | [0.71, 0.82, 0.91] |
| loss | [0.5, 0.4, 0.3] | [0.45, 0.37, 0.29] |
here each column is one of the splits/ child runs, and each row is one of the metrics you have logged to the run.
It is possible to log rows and tables in Azure ML by calling run.log_table and run.log_row respectively. In this case, the DataFrame will contain a Dictionary entry instead of a list, where the keys are the table columns (or keywords provided to log_row), and the values are the table values. e.g.
| | 0 | 1 |
|----------------|------------------------------------------|-------------------------------------------|
| accuracy_table |{'epoch': [1, 2], 'accuracy': [0.7, 0.8]} | {'epoch': [1, 2], 'accuracy': [0.8, 0.9]} |
It is also posisble to log plots in Azure ML by calling run.log_image and passing in a matplotlib plot. In this case, the DataFrame will contain a string representing the path to the artifact that is generated by AML (the saved plot in the Logs & Outputs pane of your run on the AML portal). E.g.
| | 0 | 1 |
|----------------|-----------------------------------------|---------------------------------------|
| accuracy_plot | aml://artifactId/ExperimentRun/dcid.... | aml://artifactId/ExperimentRun/dcid...|
Modifying checkpoints stored in an AzureML run
The script in examples/modify_checkpoint/modify_checkpoint.py shows how checkpoints can be downloaded from an AzureML run, modified, and the uploaded back to a newly created run.
This can be helpful for example if networks architecture changed, but you do not want to re-train the stored models with the new code.
The essential bits are:
Download files from a run via
download_files_from_run_id
Modify the checkpoints
Create a new run via
create_aml_run_object
Then use
Run.upload_folder
to upload all modified checkpoints to that new run. From there, they can be consumed in a follow-up training run again viadownload_files_from_run_id
Submitting jobs to Singularity using Amulet
PLEASE NOTE: Amulet is only intended for those with access to internal Microsoft compute resources. To access Amulet you must have an identity associated with the Microsoft tenant. This means you must be a Microsoft employee or approved external user.
The documentation below describes how to submit hi-ml
jobs via Amulet. In addition, we also provide a
minimal sample script that shows the essential parts of a trainer script and
how it interacts with Amulet.
Install Amulet
This package is not included in the hi-ml environment definition, since these instructions only apply to users associated with the Microsoft tenant. See the instructions for installing.
Create an Azure ML Storage Account
As stated in the Amulet docs, a Storage Account is required for storing information about your experiments, outputs of jobs etc. The Storage Account must be of type Storage V2. See the docs for steps on setting up.
Add your Singularity Workspace
amlt workspace add WORKSPACE_NAME --subscription SUBSCRIPTION_ID --resource-group RESOURCE_GROUP
amlt workspace set-default VC_NAME WORKSPACE_NAME
Create or checkout a project
As stated in the docs, an Amulet project “usually corresponds to a single research endeavor, e.g. a publication”. Projects contain experiments which contain jobs. To create a project:
amlt project create <your-project-name> <storage-account-name>
To manage existing projects, use:
amlt project {list|checkout|remove}
Create a configuration file
A configuration (yaml) file is required to specify your job. For example, to run the HelloWorld model via the hi-ml runner:
description: Hello World on Singularity
environment:
image: azureml/openmpi3.1.2-cuda10.2-cudnn7-ubuntu18.04:latest
conda_yaml_file: $CONFIG_DIR/hi-ml/supercomputer_environment.yml
code:
# local directory of the code. this will be uploaded to the server.
# $CONFIG_DIR is expanded to the directory of this config file
local_dir: $CONFIG_DIR
# list of jobs to run
jobs:
- name: HelloWorld
sku: G1
command:
- python hi-ml/src/health_ml/runner.py --model=health_cpath.TilesPandaImageNetMIL --tune_encoder --batch_size=2
Note that SKU here refers to the number of GPUs/CPUs to reserve, and its memory. In this case we have specified 1 GPU. For other options, see the docs.
You can specify multiple jobs here. There are a variety of arguments for controlling factors such as sla_tier
, whether
the job is preemptible
, the job priority
and more. For full details see the docs
Submit the job to Singularity
amlt run <path-to-config> <experiment-name> -t <target-cluster>
To run a specific job from a config
If you have multiple jobs specified in your config file, it is possible to submit just one of them as follows:
amlt run <path-to-config> <experiment-name> :<job_name> -t <target-cluster>
Running a distributed job
There are multiple ways to distribute a job with Amulet. The recommended way is to add the following section to your config file
env_defaults:
NODES: 1
GPUS: 8
MEM: 32
Then you should update your job definition as follows:
jobs:
- name: <job name>
sku: ${NODES}x${MEM}G${GPUS}
command:
- python <script> <args>
process_count_per_node: ${GPUS}
Additional settings and other methods for distributing can be found here.
For training jobs using PyTorch Lightning, the field process_count_per_node
can be set to 0
or omitted altogether. This indicates to Amulet that the user is responsible for spawning the
additional processes. This is the case for PyTorch Lightning, which will later spawn 1 process
per GPU.
View your job
Once your job is running, you can view it in the Azure ML UI. Alternatively, you can check on the status using the
amlt
CLI as follows:
amlt status <exp-name> :<job-name-1>
Similarly, to get the STDOUT from your job, run
amlt logs <exp-name> :<job-name-1>
To view all of your runs in one place, run:
amlt browse
This will launch a Flask app which allows you to view the runs within your project
Download the /logs outputs of your job
Amulet provides a simple way to download the logs from your job:
amlt results <exp-name> :<job-name-1>
Example Trainer Script for Amulet
The following example shows a simple PyTorch Lightning trainer script that makes use of the Amulet environment.
You will need a folder that contains the following files:
A Conda environment definition, like this.
An Amulet configuration file, like this. The configuration file should live in your repository’s root folder, and to the Conda environment definition.
A script that uses the Amulet environment, like this
The script has a large number of comments around the correct use of the Amulet environment - please read them carefully if you want to base your training code on that example.
To submit this example script as-is via Amulet, follow these steps:
Check out the hi-ml repository.
Follow the onboarding instructions in the Amulet overview section, and create an Amulet project in the root folder of the repository.
Modify
docs/source/amulet/config.yml
to point to your storage account: replace the<storage_account_name>
placeholder with the name of your storage account.Once Amulet is installed, submit the example jobs with the following command:
amlt run docs/source/amulet/config.yml <experiment_name> -t <name_of_your_compute_target>
Logging metrics when training models in and outside AzureML
This section describes the basics of logging to AzureML, and how this can be simplified when using PyTorch Lightning. It also describes helper functions to make logging more consistent across your code.
Basics
The mechanics of writing metrics to an ML training run inside of AzureML are described here.
Using the hi-ml-azure
toolbox, you can simplify that like this:
from health_azure import RUN_CONTEXT
...
RUN_CONTEXT.log(name="name_of_the_metric", value=my_tensor.item())
Similarly you can log strings (via the log_text
method) or figures (via the log_image
method), see the
documentation.
Using PyTorch Lightning
The hi-ml
toolbox relies on pytorch-lightning
for a lot of its functionality. Logging of metrics is described in
detail
here
hi-ml
provides a Lightning-ready logger object to use with AzureML. You can add that to your trainer as you would add
a Tensorboard logger, and afterwards see all metrics in both your Tensorboard files and in the AzureML UI. This logger
can be added to the Trainer
object as follows:
from health_ml.utils import AzureMLLogger
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import TensorBoardLogger
tb_logger = TensorBoardLogger("logs/")
azureml_logger = AzureMLLogger(enable_logging_outside_azure_ml=False)
trainer = Trainer(logger=[tb_logger, azureml_logger])
You do not need to make any changes to your logging code to write to both loggers at the same time. This means that, if
your code correctly writes to Tensorboard in a local run, you can expect the metrics to come out correctly in the
AzureML UI as well after adding the AzureMLLogger
.
Logging to AzureML when running outside AzureML
You may still see the need to run some of your training jobs on an individual VM, for example small jobs or for debugging. Keeping track of the results in those runs can be tricky, and comparing or sharing them even more.
All results that you achieve in such runs outside AzureML can be written straight into AzureML using the
AzureMLLogger
. Its behaviour is as follows:
When instantiated inside a run in AzureML, it will write metrics straight to the present run.
When instantiated outside an AzureML run, it will create a new
Run
object that writes its metrics straight through to AzureML, even though the code is not running in AzureML.
This behaviour is controlled by the enable_logging_outside_azure_ml
argument. With the following code snippet,
you can to use the AzureMLLogger
to write metrics to AzureML when the code is inside or outside AzureML:
from health_ml.utils import AzureMLLogger
from pytorch_lightning.loggers import TensorBoardLogger
from pytorch_lightning import Trainer
tb_logger = TensorBoardLogger("logs/")
azureml_logger = AzureMLLogger(enable_logging_outside_azure_ml=True)
trainer = Trainer(logger=[tb_logger, azureml_logger])
If this is executed on a VM outside an AzureML run, you will see additional information printed to the console like this:
Writing metrics to run ed52cfac-1b85-42ea-8ebe-2f90de21be6b in experiment azureml_logger.
To check progress, visit this URL: https://ml.azure.com/runs/ed52cfac-1b85-42ea-8ebe-2f90de21be...
Clicking on the URL will take you to the AzureML web page, where you can inspect the metrics that the run has written so far.
There are a few points that you should note:
Experiments: Each run in AzureML is associated with an experiment. When executed in an AzureML run,
the AzureMLLogger
will know which experiment to write to. Outside AzureML, on your VM, the logger will default to
using an experiment called azureml-logger
. This means that runs inside and outside AzureML end up in different
experiments. You can customize this like in the following code snippet, so that the submitted runs and the runs outside
AzureML end up in the same experiment:
from health_azure import submit_to_azure_if_needed
from health_ml.utils import AzureMLLogger
from pytorch_lightning import Trainer
experiment_name = "my_new_architecture"
submit_to_azure_if_needed(compute_cluster_name="nd24",
experiment_name=experiment_name)
azureml_logger = AzureMLLogger(enable_logging_outside_azure_ml=True, experiment_name=experiment_name)
trainer = Trainer(logger=[azureml_logger])
Snapshots: The run that you are about the create can follow the usual pattern of AzureML runs, and can create a full
snapshot of all code that was used in the experiment. This will greatly improve reproducibility of your experiments. By
default, this behaviour is turned off, though. You can provide an additional argument to the logger, like
AzureMLLogger(snapshot_directory='/users/me/code/trainer')
to include the given folder in the snapshot. In addition,
you can place a file called .amlignore
to exclude previous results, or large checkpoint files, from being included in
the snapshot
(see here for details)
Making logging consistent when training with PyTorch Lightning
A common problem of training scripts is that the calls to the logging methods tend to run out of sync. The .log
method
of a LightningModule
has a lot of arguments, some of which need to be set correctly when running on multiple GPUs.
To simplify that, there is a function log_on_epoch
that turns synchronization across nodes on/off depending on the
number of GPUs, and always forces the metrics to be logged upon epoch completion. Use as follows:
from health_ml.utils import log_on_epoch
from pytorch_lightning import LightningModule
class MyModule(LightningModule):
def training_step(self, *args, **kwargs):
...
loss = my_loss(y_pred, y)
log_on_epoch(self, loss)
return loss
Logging learning rates
Logging learning rates is important for monitoring training, but again this can add overhead. To log learning rates easily and consistently, we suggest either of two options:
Add a
LearningRateMonitor
callback to your trainer, as described hereUse the
hi-ml
functionlog_learning_rate
The log_learning_rate
function can be used at any point the training code, like this:
from health_ml.utils import log_learning_rate
from pytorch_lightning import LightningModule
class MyModule(LightningModule):
def training_step(self, *args, **kwargs):
...
log_learning_rate(self, "learning_rate")
loss = my_loss(y_pred, y)
return loss
log_learning_rate
will log values from all learning rate schedulers, and all learning rates if a scheduler returns
multiple values. In this example, the logged metric will be learning_rate
if there is a single scheduler that outputs
a single LR, or learning_rate/1/0
to indicate the value coming from scheduler index 1, value index 0.
Performance Diagnostics
The hi-ml
toolbox offers several components to integrate with PyTorch Lightning based training workflows:
The
AzureMLProgressBar
is a replacement for the default progress bar that the Lightning Trainer uses. Its output is more suitable for display in an offline setup like AzureML.The
BatchTimeCallback
can be added to the trainer to detect performance issues with data loading.
AzureMLProgressBar
The standard PyTorch Lightning is well suited for interactive training sessions on a GPU machine, but its output can get
confusing when run inside AzureML. The AzureMLProgressBar
class can replace the standard progress bar, and optionally
adds timestamps to each progress event. This makes it easier to later correlate training progress with, for example, low
GPU utilization showing in AzureML’s GPU monitoring.
Here’s a code snippet to add the progress bar to a PyTorch Lightning Trainer object:
from health_ml.utils import AzureMLProgressBar
from pytorch_lightning import Trainer
progress = AzureMLProgressBar(refresh_rate=100, print_timestamp=True)
trainer = Trainer(callbacks=[progress])
This produces progress information like this:
2021-10-20T06:06:07Z Training epoch 18 (step 94): 5/5 (100%) completed. 00:00 elapsed, total epoch time ~ 00:00
2021-10-20T06:06:07Z Validation epoch 18: 2/2 (100%) completed. 00:00 elapsed, total epoch time ~ 00:00
2021-10-20T06:06:07Z Training epoch 19 (step 99): 5/5 (100%) completed. 00:00 elapsed, total epoch time ~ 00:00
...
BatchTimeCallback
This callback can help diagnose issues with low performance of data loading. It captures the time between the end of a training or validation step, and the start of the next step. This is often indicative of the time it takes to retrieve the next batch of data: When the data loaders are not performant enough, this time increases.
The BatchTimeCallback
will detect minibatches where the estimated data loading time is too high, and print alerts.
These alerts will be printed at most 5 times per epoch, for a maximum of 3 epochs, to avoid cluttering the output.
Note that it is common for the first minibatch of data in an epoch to take a long time to load, because data loader processes need to spin up.
The callback will log a set of metrics:
timing/train/batch_time [sec] avg
andtiming/train/batch_time [sec] max
: Average and maximum time that it takes for batches to train/validatetiming/train/batch_loading_over_threshold [sec]
is the total time wasted per epoch in waiting for the next batch of data. This is computed by looking at all batches where the batch loading time was over the thresholdmax_batch_load_time_seconds
(that is set in the constructor of the callback), and totalling the batch loading time for those batches.timing/train/epoch_time [sec]
is the time for an epoch to complete.
Caveats
In distributed training, the performance metrics will be collected at rank 0 only.
The time between the end of a batch and the start of the next batch is also impacted by other callbacks. If you have callbacks that are particularly expensive to run, for example because they actually have their own model training, the results of the
BatchTimeCallback
may be misleading.
Usage example
from health_ml.utils import BatchTimeCallback
from pytorch_lightning import Trainer
batchtime = BatchTimeCallback(max_batch_load_time_seconds=0.5)
trainer = Trainer(callbacks=[batchtime])
This would produce output like this:
Epoch 18 training: Loaded the first minibatch of data in 0.00 sec.
Epoch 18 validation: Loaded the first minibatch of data in 0.00 sec.
Epoch 18 training took 0.02sec, of which waiting for data took 0.01 sec total.
Epoch 18 validation took 0.00sec, of which waiting for data took 0.00 sec total.
Running ML experiments with hi-ml
The hi-ml toolbox is capable of training any PyTorch Lighting (PL) model inside of AzureML, making use of these features:
Training on a local GPU machine or inside of AzureML without code changes
Working with different models in the same codebase, and selecting one by name
Distributed training in AzureML
Logging via AzureML’s native capabilities
Evaluation of the trained model on new datasets
This can be used by invoking the hi-ml runner and providing the name of the container class, like this:
himl-runner --model=MyContainer
.
There is a fully working example HelloContainer, that
implements a simple 1-dimensional regression model from data stored in a CSV file. You can run that
from the command line by himl-runner --model=HelloWorld
.
Specifying the model to run
The --model
argument specifies the name of a class that should be used for model training. The class needs to
be a subclass of LightningContainer
, see below. There are different ways of telling the runner where to find
that class:
If just providing a single class name, like
--model=HelloWorld
, the class is expected somewhere in thehealth_ml.configs
namespace. It can be in any module/folder inside of that namespace.If the class is outside of the
health_ml.configs
(as would be normal if using thehiml-runner
from a package), you need to provide some “hints” where to start searching. It is enough to provide the start of the namespace string: for example,--model health_cpath.PandaImageNetMIL
is effectively telling the runner to search for thePandaImageNetMIL
class anywhere in thehealth_cpath
namespace. You can think of this ashealth_cpath.*.PandaImageNetMIL
Running ML experiments in Azure ML
To train in AzureML, use the flag --cluster
to specify the name of the cluster
in your Workspace that you want to submit the job to. So the whole command would look like:
himl-runner --model=HelloWorld --cluster=my_cluster_name
You can also specify --num_nodes
if you wish to distribute the model training.
When starting the runner, you need to do that from a directory that contains all the code that your experiment needs: The current working directory will be used as the root of all data that will be copied to AzureML to run your experiment. (the only exception to this rule is if you start the runner from within an enlistment of the HI-ML GitHub repository).
AzureML needs to know which Python/Conda environment it should use. For that, the runner needs a file environment.yml
that contains a Conda environment definition. This file needs to be present either in the current working directory or
one of its parents. To specify a Conda environment that is located elsewhere, you can use
himl-runner --model=HelloWorld --cluster=my_cluster_name --conda_env=/my/folder/to/special_environment.yml
Setup - creating your model config file
In order to use these capabilities, you need to implement a class deriving from
health_ml.lightning_container.LightningContainer
. This class encapsulates everything that is needed for training
with PyTorch Lightning:
For example:
class MyContainer(LightningContainer):
def __init__(self):
super().__init__()
self.azure_datasets = ["folder_name_in_azure_blob_storage"]
self.local_datasets = [Path("/some/local/path")]
self.max_epochs = 42
def create_model(self) -> LightningModule:
return MyLightningModel()
def get_data_module(self) -> LightningDataModule:
return MyDataModule(root_path=self.local_dataset)
The create_model
method needs to return a subclass of PyTorch Lightning’s LightningModule, that has
all the usual PyTorch Lightning methods required for training, like the training_step
and forward
methods. E.g:
class MyLightningModel(LightningModule):
def __init__(self):
self.layer = ...
def training_step(self, *args, **kwargs):
...
def forward(self, *args, **kwargs):
...
def configure_optimizers(self):
...
def test_step(self, *args, **kwargs):
...
The get_data_module
method of the container needs to return a DataModule (inheriting from a PyTorch Lightning DataModule) which contains all of the logic for
downloading, preparing and splitting your dataset, as well as methods for wrapping the train, val and test datasets
respectively with DataLoaders. E.g:
class MyDataModule(LightningDataModule):
def __init__(self, root_path: Path):
# All data should be read from the folder given in self.root_path
self.root_path = root_path
def train_dataloader(self, *args, **kwargs) -> DataLoader:
# The data should be read off self.root_path
train_dataset = ...
return DataLoader(train_dataset, batch_size=5, num_workers=5)
def val_dataloader(self, *args, **kwargs) -> DataLoader:
# The data should be read off self.root_path
val_dataset = ...
return DataLoader(val_dataset, batch_size=5, num_workers=5)
def test_dataloader(self, *args, **kwargs) -> DataLoader:
# The data should be read off self.root_path
test_dataset = ...
return DataLoader(test_dataset, batch_size=5, num_workers=5)
So, the full file would look like:
from pathlib import Path
from torch.utils.data import DataLoader
from pytorch_lightning import LightningModule, LightningDataModule
from health_ml.lightning_container import LightningContainer
class MyLightningModel(LightningModule):
def __init__(self):
self.layer = ...
def training_step(self, *args, **kwargs):
...
def forward(self, *args, **kwargs):
...
def configure_optimizers(self):
...
def test_step(self, *args, **kwargs):
...
class MyDataModule(LightningDataModule):
def __init__(self, root_path: Path):
# All data should be read from the folder given in self.root_path
self.root_path = root_path
def train_dataloader(self, *args, **kwargs) -> DataLoader:
# The data should be read off self.root_path
train_dataset = ...
return DataLoader(train_dataset, batch_size=5, num_workers=5)
def val_dataloader(self, *args, **kwargs) -> DataLoader:
# The data should be read off self.root_path
val_dataset = ...
return DataLoader(val_dataset, batch_size=5, num_workers=5)
def test_dataloader(self, *args, **kwargs) -> DataLoader:
# The data should be read off self.root_path
test_dataset = ...
return DataLoader(test_dataset, batch_size=5, num_workers=5)
class MyContainer(LightningContainer):
def __init__(self):
super().__init__()
self.azure_datasets = ["folder_name_in_azure_blob_storage"]
self.local_datasets = [Path("/some/local/path")]
self.max_epochs = 42
def create_model(self) -> LightningModule:
return MyLightningModel()
def get_data_module(self) -> LightningDataModule:
return MyDataModule(root_path=self.local_dataset)
By default, config files will be looked for in the folder “health_ml.configs”. To specify config files
that live elsewhere, use a fully qualified name for the parameter --model
- e.g. “MyModule.Configs.my_config.py”
Outputting files during training
The Lightning model returned by create_model
needs to write its output files to the current working directory.
When running inside of AzureML, the output folders will be directly under the project root. If not running inside
AzureML, a folder with a timestamp will be created for all outputs and logs.
When running in AzureML, the folder structure will be set up such that all files written to the current working directory are later uploaded to Azure blob storage at the end of the AzureML job. The files will also be later available via the AzureML UI.
Trainer arguments
All arguments that control the PyTorch Lightning Trainer
object are defined in the class TrainerParams
. A
LightningContainer
object inherits from this class. The most essential one is the max_epochs
field, which controls
the max_epochs
argument of the Trainer
.
For example:
from pytorch_lightning import LightningModule, LightningDataModule
from health_ml.lightning_container import LightningContainer
class MyContainer(LightningContainer):
def __init__(self):
super().__init__()
self.max_epochs = 42
def create_model(self) -> LightningModule:
return MyLightningModel()
def get_data_module(self) -> LightningDataModule:
return MyDataModule(root_path=self.local_dataset)
Optimizer and LR scheduler arguments
To the optimizer and LR scheduler: the Lightning model returned by create_model
should define its own
configure_optimizers
method, with the same signature as LightningModule.configure_optimizers
,
and returns a tuple containing the Optimizer and LRScheduler objects
Run inference with a pretrained model
You can use the hi-ml-runner in inference mode only by switching the --run_inference_only
flag on and specifying
the model weights by setting --src_checkpoint
argument. With this flag, the model will be evaluated on the test set
only. There is also an option for evaluating the model an a full dataset, described further below.
Specifying the checkpoint to use
When running inference on a trained model, you need to provide a model checkpoint that should be used. This is done via
the --src_checkpoint
argument. This supports three types of checkpoints:
A local path where the checkpoint is stored
--src_checkpoint=local/path/to/my_checkpoint/model.ckpt
A remote URL from where to download the weights
--src_checkpoint=https://my_checkpoint_url.com/model.ckpt
An AzureML run id where checkpoints are saved in
outputs/checkpoints
. For this specific use case, you can experiment with different checkpoints by setting--src_checkpoint
according to the format<azureml_run_id>:<optional/custom/path/to/checkpoints/><filename.ckpt>
. If no custom path is provided (e.g.,--src_checkpoint=AzureML_run_id:best.ckpt
), we assume the checkpoints to be saved in the default checkpoints folderoutputs/checkpoints
. If no filename is provided (e.g.,--src_checkpoint=AzureML_run_id
), the last epoch checkpointoutputs/checkpoints/last.ckpt
will be loaded.
Refer to Checkpoints Utils for more details on how checkpoints are parsed.
Running inference on the test set
When supplying the flag --run_inference_only
on the commandline, no model training will be run, and only inference on the
test set will be done:
The model weights will be loaded from the location specified by
--src_checkpoint
A PyTorch Lightining
Trainer
object will be instantiated.The test set will be read out from the data module specified by the
get_data_module
method of theLightningContainer
object.The model will be evaluated on the test set, by running
trainer.test
. Any special logic to use during the test step will need to be added to the model’stest_step
method.
Running the following command line will run inference using the MyContainer
model with weights from the checkpoint saved
in the AzureMl run MyContainer_XXXX_yyyy
at the best validation loss epoch /outputs/checkpoints/best_val_loss.ckpt
.
himl-runner --model=Mycontainer --run_inference_only --src_checkpoint=MyContainer_XXXX_yyyy:best_val_loss.ckpt
Running inference on a full dataset
When supplying the flag --mode=eval_full
on the commandline, no model training will be run, and the model will be
evaluated on a dataset different from the training/validation/test dataset. This dataset is loaded via the
get_eval_data_module
method of the container.
The model weights will be loaded from the location specified by
--src_checkpoint
A PyTorch Lightining
Trainer
object will be instantiated.The test set will be read out from the data module specified by the
get_eval_data_module
method of theLightningContainer
object. The data module itself can read data from a mounted Azure dataset, which will be made availabe for the container at the pathself.local_datasets
. In a typical use-case, all the data in that dataset will be put into thetest_dataloader
field of the data module.The model will be evaluated on the test set, by running
trainer.test
. Any special logic to use during the test step will need to added to the model’stest_step
method.
Running the following command line will run inference using the MyContainer
model with weights from the checkpoint saved
in the AzureMl run MyContainer_XXXX_yyyy
at the best validation loss epoch /outputs/checkpoints/best_val_loss.ckpt
.
himl-runner --model=Mycontainer --src_checkpoint=MyContainer_XXXX_yyyy:best_val_loss.ckpt --mode=eval_full --azure_datasets=my_new_dataset
The example code snippet here shows how to add a method that reads the inference dataset. In this example, we assume
that the MyDataModule
class has an argument splits
that specifies the fraction of data to go into the training,
validation, and test data loaders.
class MyContainer(LightningContainer):
def __init__(self):
super().__init__()
self.azure_datasets = ["folder_name_in_azure_blob_storage"]
self.local_datasets = [Path("/some/local/path")]
self.max_epochs = 42
def create_model(self) -> LightningModule:
return MyLightningModel()
def get_data_module(self) -> LightningDataModule:
return MyDataModule(root_path=self.local_dataset, splits=(0.7, 0.2, 0.1))
def get_eval_data_module(self) -> LightningDataModule:
return MyDataModule(root_path=self.local_dataset, splits=(0.0, 0.0, 1.0))
Resume training from a given checkpoint
Analogously, one can resume training by setting --src_checkpoint
and --resume_training
to train a model longer.
The pytorch lightning trainer will initialize the lightning module from the given checkpoint corresponding to the best
validation loss epoch as set in the following comandline.
himl-runner --model=Mycontainer --cluster=my_cluster_name --src_checkpoint=MyContainer_XXXX_yyyy:best_val_loss.ckpt --resume_training
Warning: When resuming training, one should make sure to set container.max_epochs
greater than the last epoch of the
specified checkpoint. A misconfiguration exception will be raised otherwise:
pytorch_lightning.utilities.exceptions.MisconfigurationException: You restored a checkpoint with current_epoch=19, but you have set Trainer(max_epochs=4).
Logging to AzureML when running outside AzureML
The runner offers the ability to log metrics to AzureML, even if the present training is not running inside of AzureML. This adds an additional level of traceability for runs on GPU VMs, where there is otherwise no record of any past training.
You can trigger this behaviour by specifying the --log_from_vm
flag. For the HelloWorld
model, this
will look like:
himl-runner --model=HelloWorld --log_from_vm
For logging to work, you need have a config.json
file in the current working directory (or one of its
parent folders) that specifies the AzureML workspace itself. When starting the runner, you will be asked
to authenticate to AzureML.
There are two additional flags that can be used to control the logging behaviour:
The
--experiment
flag sets which AzureML experiment to log to. By default, the experiment name will be the name of the model class (HelloWorld
in the above example).The
--tag
flag sets the display name for the AzureML run. You can use that to give your run a memorable name, and later easily find it in the AzureML UI.
The following command will log to the experiment my_experiment
, in a run that is labelled my_first_run
in the UI:
himl-runner --model=HelloWorld --log_from_vm --experiment=my_experiment --tag=my_first_run
Starting experiments with different seeds
To assess the variability of metrics, it is often useful to run the same experiment multiple times with different seeds.
There is a built-in functionality of the runner to do this. When adding the commandline flag --different_seeds=3
, your
experiment will get run 3 times with seeds 0, 1 and 2. This is equivalent to starting the runner with arguments
--random_seed=0
, --random_seed=1
and --random_seed=2
.
These runs will be started in parallel in AzureML via the HyperDrive framework. It is not possible to run with different
seeds on a local machine, other than by manually starting runs with --random_seed=0
etc.
Common problems with running in AML
"Your total snapshot size exceeds the limit <SNAPSHOT_LIMIT>"
. Cause: The size of your source directory is larger than the limit that AML sets for snapshots. Solution: check for cache files, log files or other files that are not necessary for running your experiment and add them to a.amlignore
file in the root directory. Alternatively, you can see Azure ML documentation for instructions on increasing this limit, although it will make your jobs slower."FileNotFoundError"
. Possible cause: Symlinked files. Azure ML SDK v2 will resolve the symlink and attempt to upload the resolved file. Solution: Remove symlinks from any files that should be uploaded to Azure ML.
Checkpoint Utils
Hi-ml toolbox offers different utilities to parse and download pretrained checkpoints that help you abstract checkpoint downloading from different sources. Refer to CheckpointParser for more details on the supported checkpoints format. Here’s how you can use the checkpoint parser depending on the source:
For a local path, simply pass it as shown below. The parser will further check if the provided path exists:
from health_ml.utils.checpoint_utils import CheckpointParser
download_dir = 'outputs/checkpoints'
checkpoint_parser = CheckpointParser(checkpoint='local/path/to/my_checkpoint/model.ckpt')
print('Checkpoint', checkpoint_parser.checkpoint, 'is a local file', checkpoint_parser.is_local_file)
local_file = parser.get_or_download_checkpoint(download_dir)
To download a checkpoint from a URL:
from health_ml.utils.checpoint_utils import CheckpointParser, MODEL_WEIGHTS_DIR_NAME
download_dir = 'outputs/checkpoints'
checkpoint_parser = CheckpointParser('https://my_checkpoint_url.com/model.ckpt')
print('Checkpoint', checkpoint_parser.checkpoint, 'is a URL', checkpoint_parser.is_url)
# will dowload the checkpoint to download_dir/MODEL_WEIGHTS_DIR_NAME
path_to_ckpt = checkpoint_parser.get_or_download_checkpoint(download_dir)
Finally checkpoints from an Azure ML runs can be reused by providing an id in this format
<AzureML_run_id>:<optional/custom/path/to/checkpoints/><filename.ckpt>
. If no custom path is provided (e.g.,<AzureML_run_id>:<filename.ckpt>
) the checkpoint will be downloaded from the default checkpoint folder (e.g.,outputs/checkpoints
) If no filename is provided, (e.g.,src_checkpoint=<AzureML_run_id>
) the latest checkpoint will be downloaded (e.g.,last.ckpt
).
from health_ml.utils.checpoint_utils import CheckpointParser
checkpoint_parser = CheckpointParser('AzureML_run_id:best.ckpt')
print('Checkpoint', checkpoint_parser.checkpoint, 'is a AML run', checkpoint_parser.is_aml_run_id)
path_azure_ml_ckpt = checkpoint_parser.get_or_download_checkpoint(download_dir)
If the Azure ML run is in a different workspace, a temporary SAS URL to download the checkpoint can be generated as follow:
cd hi-ml-cpath
python src/health_cpath/scripts/generate_checkpoint_url.py --run_id=AzureML_run_id:best_val_loss.ckpt --expiry_days=10
N.B: config.json should correspond to the original workspace where the AML run lives.
Use cases
CheckpointParser is used to specify a src_checkpoint
to resume training from a given
checkpoint,
or run inference with a pretrained model,
as well as
ssl_checkpoint
for computation pathology self supervised pretrained encoders.
HI-ML Tools for Computational Pathology
The directory hi-ml-cpath
contains code
for runnning experiments in Computational Pathology.
The tools for computational pathology are best used directly from the Git repository.
You can also use the hi-ml-cpath
PyPi package
to re-use the code in your own projects, for example the deep learning architectures.
Setting up your computer
Please follow the instructions in README to set up your local Python environment.
Onboarding to Azure
Please follow the instructions here to create an AzureML workspace if you don’t have one yet. You will also need to download the workspace configuration file, as described here, so that your code knows which workspace to access.
Creating datasets
In our example models, we are working with two public datasets, PANDA and TCGA-Crck.
Please follow the detailed instructions to download and prepare these datasets in Azure.
Training models
Visualizing data and results in Digital Slide Archive DSA
New Model configurations
To define your own model configuration, place a class definition in the directory health_cpath.configs
. The class should
inherit from a
LightningContainer.
As an example, please check the HelloWorld
model
or the base class for the MIL
models.
Mount datasets
If you would like to inspect or analyze the datasets that are stored in Azure Blob Storage, you can either download them or mount them. “Mounting” here means that the dataset will be loaded on-demand over the network (see also the docs). This is ideal if you expect that you will only need a small number of files, or if the disk of your machine is too small to download the full dataset.
You can mount the dataset by executing this script in <root>/hi-ml-cpath
:
python src/histopathology/scripts/mount_azure_dataset.py --dataset_id PANDA
After a few seconds, this may bring up a browser to authenticate you in Azure, and let you access the AzureML
workspace that you chose by downloading the config.json
file. If you get an error message saying that authentication
failed (error message contains “The token is not yet valid (nbf)”), please ensure that your
system’s time is set correctly and then try again. On WSL, you can use sudo hwclock -s
.
Upon success, the script will print out:
Dataset PANDA will be mounted at /tmp/datasets/PANDA.
Using Public Datasets PANDA and TCGA-Crck
Setting up your storage account and tools
Your Azure account needs to have permissions to write to the storage account. You need to have, for example, “Storage Blob Data Contributor” permissions.
Your storage account should have a container called datasets
. Verify in the Azure Portal that this container exists:
You can do that by going to the “Containers” section in the left-hand navigation. If there is no such container, create
one with the “+ Container” button.
To upload datasets to Azure, we recommend using azcopy. You will first need to log in,
by calling azcopy login
. Follow the instructions at the prompt.
PANDA Dataset
The PANDA dataset was released with the Prostate cANcer graDe Assessment (PANDA) Challenge. The dataset is available from Kaggle. To get access to the dataset, you need to register with Kaggle, then press “Download all”. This will download a 200GB ZIP file.
Uploading
Now unzip the ZIP file. Then rename the folder in which the files reside to PANDA
. To double-check, there should now
be a file PANDA/train.csv
in your current directory.
head PANDA/train.csv # just to check if we are in the right folder
azcopy copy PANDA https://<your_storage_account>.blob.core.windows.net/datasets/ --recursive
TCGA-Crck Dataset
This dataset contains histological images from patients with colorectal cancer, available from here.
To download and prepare the dataset, please run the following commands in a Linux shell in the root folder of the git repository. To prepare:
In the last statement, where we upload the full dataset to Azure, replace
<your_storage_account>
with the name of your Azure storage account.For Python to pick up the paths in
hi-ml-cpath/src/histopathology/scripts/tcga_dataset_prep.py
, you need to add the `hi
Note: Depending on the speed of your internet connection, this script can run for several hours because it downloads a total of more than 20GB of files. It is advantageous to run the script in a Virtual Machine in Azure (downloading can be easily automated because no authentication is required).
mkdir TCGA-Crck
cd TCGA-Crck
# Download the files and unzip into folder broken down by Train/Test
for file in CRC_DX_TRAIN_MSIMUT.zip CRC_DX_TRAIN_MSS.zip
do
wget https://zenodo.org/record/2530835/files/$file
unzip $file -d CRC_DX_TRAIN
rm $file
done
for file in CRC_DX_TEST_MSIMUT.zip CRC_DX_TEST_MSS.zip
do
wget https://zenodo.org/record/2530835/files/$file
unzip $file -d CRC_DX_TEST
rm $file
done
# Create a summary file dataset.csv with all file paths and class labels
cd ..
export PYTHONPATH=`pwd`/hi-ml-cpath/src
python hi-ml-cpath/src/histopathology/scripts/tcga_dataset_prep.py
# Upload
azcopy copy TCGA-Crck https://<your_storage_account>.blob.core.windows.net/datasets/ --recursive
Making your storage account accessible to AzureML
As a last step, you need to ensure that AzureML has access to your storage account. For that, you need to create a datastore. A datastore is an abstraction layer on top of the plain storage account, where AzureML stores an account key or access token that allows it to later download the data to the training machine.
To create the datastore, please follow the instructions here. Once the datastore is created, mark it as the default datastore.
DeepMIL model for tumor grading on PANDA dataset
Background
The repository contains the configuration for training Deep Multiple Instance Learning (DeepMIL) models for ISUP score prediction on the PANDA challenge dataset. The models are developed to reproduce the results described in Myronenko et al. 2021 and the example in the Project-MONAI tutorial.
A ResNet50 encoder that was pre-trained on ImageNet is downloaded on-the-fly (from here at the start of the training run.
Preparations
Please follow the instructions in the Readme file to create a Conda environment and activate it, and the instructions to set up Azure.
You will also need to run the dataset preparations for the PANDA dataset, as described here.
Running the model as-is
If you have a GPU available, you can run training on that machine, by executing in <root>/hi-ml-cpath
:
conda activate HimlHisto
python ../hi-ml/src/health_ml/runner.py --model health_cpath.SlidesPandaImageNetMILBenchmark
Running the model will automatically mount (download on-the-fly) the PANDA dataset from Azure. To enable that, you will be asked to log into Azure a few seconds after the start of the script. This will either pop up a browser window automatically, or give you a prompt on the console to open the browser.
Once the authentication is completed, it will access the AzureML
workspace that you chose by downloading the config.json
file. If you get an error message saying that authentication
failed, “The token is not yet valid (nbf)”, please ensure that your
system’s time is set correctly (on WSL, use sudo hwclock -s
) and then try again.
However, the GPU demand for this model is rather high. We recommend running in AzureML, on a GPU compute cluster. You
can run the training in the cloud by simply appending name of the compute cluster, --cluster=<your_cluster_name>
. In
addition, you can turn on fine-tuning of the encoder, which will improve the results further:
conda activate HimlHisto
python ../hi-ml/src/health_ml/runner.py --model health_cpath.SlidesPandaImageNetMILBenchmark --tune_encoder --cluster=<your_cluster_name>
Then the script will output “Successfully queued run number …” and a line prefixed “Run URL: …”. Open that URL to view the submitted run in AzureML, view progress, metrics, etc.
Expected results
The best runs so far uses Transformer Pooling layer similar to the one implemented in Myronenko et. al 2021
(pool_type=TransformerPoolingBenchmark.__name__
), in combintation with fine-tuning the encoder.
SlidesPandaImageNetMIL
model trained with fine-tuning, cross validation metrics (mean ± std):
Validation accuracy: 0.7828 ± 0.0037
Validation AUROC: 0.9473 ± 0.002
Validation QWK: 0.8793 ± 0.0074
For internal reference, this was run HD_0e805b91-319d-4fde-8bc3-1cea3a6d08dd
on innereye4ws
.
Model variants
Six different pooling layers can be used by changing parameter pool_type
in the configuration file, or chosen on the commandline or from CLI. Available pooling layers include:
pool_type=AttentionLayer.__name__
: Attention layer from Ilse et al. 2018pool_type=GatedAttentionLayer.__name__
: Gated attention layer from Ilse et al. 2018pool_type=MaxPoolingLayer.__name__
: Max pooling layer returns frequency normalized weights and the maximum feature vector over the first axispool_type=MeanPoolingLayer.__name__
: Mean pooling layer returns uniform weights and the average feature vector over the first axispool_type=TransformerPooling.__name__
: Transformer pooling layerpool_type=TransformerPoolingBenchmark.__name__
: Transformer pooling layer used in Myronenko et al. 2021
Use the optional runner argument --mount_in_azureml
to mount the PANDA dataset on AzureML, instead of downloading it.
This will mean that the job starts faster, but may not run at maximum speed because of network bottlenecks.
Cross-validation
To use cross-validation, supply the additional commandline flag --crossval_count=5
for 5-fold cross-validation, like:
python ../hi-ml/src/health_ml/runner.py --model health_cpath.idesPandaImageNetMILBenchmark --crossval_count=5 --cluster=<your_cluster_name>
Cross-validation will start 5 training runs in parallel. For this reason, cross-validation can only be used in AzureML.
To compute aggregated metrics of the hyperdrive run in Azure ML, replace the run_id
in
hi-ml-cpath/src/histopathology/scripts/aggregate_metrics_crossvalidation.py
with the Run ID of the hyperdrive
run, and run the script as follows:
conda activate HimlHisto
python hi-ml-cpath/src/histopathology/scripts/aggregate_metrics_crossvalidation.py
DeepMIL model for genetic mutation on TCGA-CRCK
Background
The repository contains the configuration for training a Deep Multiple Instance Learning (DeepMIL) models for microsatellite instability (MSI) prediction on the TCGA-CRCk dataset. Instructions to download and prepare the dataset are here. The dataset is composed of tiles.
The models are developed to reproduce the results described in the DeepSMILE papers Schirris et al. 2021 and Schirris et al. 2022.
A ResNet18 encoder that was pre-trained on ImageNet is downloaded on-the-fly (from here at the start of the training run when using the ImageNet configuration. The SSL MIL configuration requires the checkpoint of a pre-trained SSL encoder (ResNet50) on the TCGA-CRCk dataset. The SSL encoder that can be obtained following the instructions on how to train a custom SSL encoder on a pre-tiled dataset.
Preparations
Please follow the instructions in the Readme file to create a Conda environment and activate it, and the instructions to set up Azure to run in the cloud.
Running the model
You can run the model in the same way you run the benchmark model on PANDA. If you have GPU available locally, you can run training on that machine, by executing in <root>/hi-ml-histopathology
:
conda activate HimlHisto
python ../hi-ml/src/health_ml/runner.py --model health_cpath.TcgaCrckImageNetMIL
If you setup a GPU cluster in Azure, you can run the training in the cloud by simply appending name of the compute cluster:
conda activate HimlHisto
python ../hi-ml/src/health_ml/runner.py --model health_cpath.TcgaCrckImageNetMIL --cluster=<your_cluster_name>
Assuming you pre-trained your own SSL encoder and updated the checkpoint path in the SSLencoder class, the SSL MIL model configuration can be run using the flag --model histopathology.TcgaCrckSSLMIL
.
Expected results
ImageNet MIL (ResNet18)
Test AUROC: 0.706 ± 0.041
Test F1 score: 0.490 ± 0.041
SSL MIL (ResNet50)
Test AUROC: 0.825 ± 0.065
Test F1 score: 0.516 ± 0.171
Pre-training of an image encoder using self-supervised learning
Often in histopathology, we only have access to weak labels, e.g., a single label for an entire Whole Slide Image (WSI). However, papers like DeepSmiles show that we can use unlabeled tiles from WSI to pre-train an image encoder using Self-Supervised Learning (SSL). In hi-ml, we have implemented two popular self-supervised learning methods SimCLR and BYOL. We will use the TCGA-CRCk dataset, as seen in Kather et al. 2019, the dataset comes with binary WSI labels (microsatellite stable or instable). We will use the TCGA-CRCk dataset as an example to show how to set up SSL training in hi-ml. If you want to use your own dataset, you will find instructions at the end of this section.
Example: Train an image encoder using SSL on the TCGA-CRCk locally
The TCGA-CRCk dataset consists of colorectal tumor tiles extracted from Formalin-Fixed, Paraffin-Embedded (FFPE) WSIs from the Cancer Genome Atlas (TCGA) with accompanying binarized MicroSatellite Instability (MSI) labels. In the case of TCGA-CRCk, the dataset is already tiled, i.e., the WSI are not available. In public_datasets.md you will find instructions on how to download and setup the TCGA-CRCk dataset.
To train an image encoder using SSL locally run this in the hi-ml-cpath
folder, with the HimlHisto
conda
enviroment activated:
python ../hi-ml/src/health_ml/runner.py --model SSL.CRCK_SimCLR
The model class
CRCK_SimCLR
is the config used to train a SSL model on TCGA-CRCk. It houses everything, e.g., the model, the dataset, checkpointing,
etc. Here, we need to define some important parameters:
The type of image encoder we want to train, the type of SSL (SimCLR or BYOL) we want to use, and the batch_size.
The dataset we want to use for training the image encoder and the linear model we only use for evaluation of the image encoder. In theory, they could be two different datasets.
Model checkpointing: We use PyTorch lightning checkpointing. Among others, we define the validation metric, where the
online_evaluator
is the same as thelinear_head
. In the case of TCGA_CRCK, we use AUC ROC as the validation metric.
In the parent class of CRCK_SimCLR
, HistoSSLContainer
the data augmentations are defined. Data augmentation
is one of the most important components of SSL training. Currently, we have hardcoded the data augmentation used in the
SimCLR paper. These are the following:
While not optimized for WSI we observe good performance using these augmentations. The data augmentations are wrapped by
DualViewTransformWrapper
to return two augmented versions per tile, as required by the majority of SSL methods.
Train on Azure
In the case of SimCLR, the effective batch_size (batch_size * GPU) should be as big as possible. The SSL models in hi-ml natively supports distributed training using multiple GPUs. We recommend using 8 GPUs for running the SimCLR model on the TCGA-CRCk dataset. Assuming you are using a total of 8 GPUs (e.g. 1 node with 8 GPUs or 2 nodes with 4 GPUs) in Azure you can start training with the following command in the repository root folder:
python hi-ml/src/health_ml/runner.py --model SSL.CRCK_SimCLR --cluster CLUSTER_NAME --conda_env hi-ml-cpath/environment.yml
A SimCLR run with 200 epochs, 8 GPUs, and a batch size of 48 (per GPU) takes about 6 hours. On Azure we use Standard_ND40rs_v2 (40 cores, 672 GB RAM, 2900 GB disk, 8 x NVIDIA Tesla V100).
Let’s have a look at the training behavior.
As mentioned previously, using the WSI label for each tile of the same slide and a linear head on the outputs of the image encoder to monitor training works quite well. We see a smooth and steady increase of the validation metric.
In addition, we are using a cosine learning rate schedule with a fixed warm up of 10 epochs. Note: The SSL code in hi-ml automatically scales the learning rate to the number of GPUs used during training, as described here.
Last, the training and validation loss curves are expected to look like this.
After training, we can use the pre-trained image encoder on downstream tasks like microsatellite stable/instable
prediction on TCGA-CRCk. You only have to specify the path to the checkpoint of the SSL image encoder in the setup
function of
DeepSMILECrck
.
Using your own dataset
For scripts that help you tile your own dataset please see
histopathology/preprocessing/tiling.py
.
In the case of TCGA-CRCk, the dataset is already tiled.
TcgaCrck_TilesDataset
is a child of
TilesDataset
.
For a
TilesDataset
we assume that each tile has a unique tile index and a label. Since we assume that we are working with weakly labelled
WSI we do not have access to the real tile labels. However, we found that monitoring the SSL training using the slide
label for each tile works sufficiently well. I.e., if the WSI has a positive label then every tile from the WSI has a
positive label. Last, the unique index for each tile is used to make sure we don’t use twice the same tile in one
training epoch during SSL training.
Subsequently, the TCGA-CRCk dataset is wrapped in
TcgaCrck_TilesDatasetReturnImageLabel
.
Here the data augmentations are applied and the __getitem__
method is defined.
The dataset is then wrapped one last time in
TcgaCrck_TilesDatasetWithReturnIndex
,
where we inherit the ability to return the tile index from
DatasetWithReturnIndex
.
Digital Slide Archive
The Digital Slide Archive (DSA) is “a containerized web-based platform for the analysis, visualization, management and annotation of whole-slide digital pathology imaging data”. It is an open-source project based on Kitware’s data management platform Girder.
Azure deployment
We have deployed the DSA on Azure to visualize our data and interpret our models and experiments. Below are instructions to replicate our deployment using your own data.
Host
The first step is creating a Linux virtual machine (VM) to host the DSA. The code can be downloaded from GitHub and run using Docker. By default, the application runs at port 8080. To use HTTPS, an SSL certificate must be obtained and port 443 may be specified instead within the Docker configuration. The DSA uses Cherrypy as the underlying server engine. Therefore, Cherrypy must be configured to use the SSL certificate installed in the VM. Ports in a VM can be opened using network security group rules.
Storage
The datasets we use are securely stored in Azure Blob Storage containers. Containers can be mounted on the host machine using BlobFuse. Then, Girder assetstores can be created and fed data from the mounted containers.
Authentication
We have added Microsoft as a provider in the OAuth2 Login plugin for Girder. The authentication settings widget describes the steps needed to generate the provider credentials.
First, our deployed DSA must be registered in Azure Active Directory (Azure AD) as an application.
Once the DSA app has been registered, a client ID (also called application ID) will be generated. Then, a client secret (also called application password) must also be generated. Finally, the tenant ID (the Azure ID of your organization) may be specified so that access is restricted to users belonging to your tenant. These three strings will be used to configure authentication in the OAuth2 Login plugin settings.
Authorization
User permissions can be set to different data collections in an assetstore, to maximize protection of sensitive data.
Creating API Keys
You can create an API key for a user in DSA directly on the web.
Navigate to your DSA and log in
Click on “Users” in the navigation bar
Click on the user you want to create an API key for
On the top right, there is a menu “Actions”, choose “Edit user”
You will see 4 tabs with user information, click on “API Keys”
Choose a name for the key, a duration in days. It is important to choose “Allow all actions on behalf of this user”!
Click “Create”. You will see the list of keys, press “show” to reveal the actual key value. Copy it.
Set the key as an environment variable
DSA_API_KEY
(inbash
, this would beexport DSA_API_KEY=<the key>
)In addition, you can also set the URL for your DSA instance as an environment variable
DSA_URL
(inbash
, this would beexport DSA_URL=<the url>
)
Visualizing Azure Machine Learning results
The Girder RESTful API may be used to upload annotations to DSA items programmatically.
An example of a use case is creating annotations to visualize attention maps generated by a deep learning model.
The girder
module in this repository includes tools to download training results from Azure Machine Learning (Azure ML) and
upload annotations to a deployed DSA.
For example:
SCRIPT="hi-ml-cpath/src/histopathology/utils/girder.py"
python $SCRIPT \
--run-id "Experiment_transformer_Gleason_run_0" \
--dsa-url "https://my-deployed-dsa.azure.com/" \
--dsa-key "AHKZ42Ks24kSH5Fxt3354ryKzCxamjqM" \
--workspace-config "config.json" \
The DSA URL and API key may be specified in environment variables DSA_URL
and DSA_API_KEY
instead, see above.
The workspace configuration file contains information related to the Azure Machine Learning workspace:
{
"subscription_id": "22f7beb4-1b54-9ee7-5255-6dbcc8da000b",
"resource_group": "myresourcegroup",
"workspace_name": "myworkspace"
}
It can be downloaded from the workspace website:
The script uses the Azure SDK and the Python client for the Girder API to:
Log into Azure ML
Download training results
Generate JSON annotations for each slide
Search for the slide by slide ID in DSA by using full text search
Upload the annotations to the deployed DSA
For a full description of all the options, add --help
to the arguments.
Below is an example of an attention map overlaid on a slide from the PANDA dataset:
Upload into a folder
The girder.py
script can also upload the annotations into a folder in DSA. This is helpful if there are multiple
variants of the same slide in DSA, but located in different folders. Simple text search would then return multiple
results.
To use the folder functionality, you need to supply the name of the folder via --folder
. For example, to upload the
annotations to folder foo
in collection Collection
, add ``–folder Collection1/foo`.
When supplying a folder argument, searching for the slide where annotations will be added works differently:
Firstly, all slides in the folder are retrieved.
Then, the script tries to identify which of the slides in the folder contains the value from the
slide_id
field.If there is exactly one matching slide, the annotations are added to that slide. Annotations that match zero or more than one slide are ignored.
Creating montages from whole slide images (WSIs)
For working with large amounts of histology data, it is often useful to create montages of the data.
Montages are a collection of images that are stitched together to form a single image.
Montages are useful for visualizing large amounts of data at once, and can be used to create a single image that can be used for analysis.
The hi-ml-cpath
toolbox contains scripts that help with the creation of montages from whole slide images (WSIs).
Creating montages can be very time-consuming. It can hence be helpful to run the process in the cloud. The montage creation code provided here can be run in AzureML very easily.
Types of data for montage creation
Montages can be created from a folder of images, by specifying the name of the folder and a glob pattern, like
**/foo_*.tiff
.Montages can be created by first reading a file called
dataset.csv
located in a folder.dataset.csv
is effectively a Pandas DataFrame, with each row corresponding to a single image. More details on the format ofdataset.csv
can be found below.
Montage creation works for all WSI image formats that are supported by either of the two possible backends:
Setup
Check out the
hi-ml
repository viagit clone https://github.com/microsoft/hi-ml
Run the following commands:
cd hi-ml-cpath
make env
conda activate HimlHisto
make pip_local
All the commands listed below assume that
you have activated the Conda environment
HimlHisto
your current working directory is
<repo root>/hi-ml-cpath
Creating montages from a folder with files
The following command will create a montage from all files in the folder /data
that match the pattern
**/*.tiff
.
python src/health_cpath/scripts/create_montage.py --dataset /data --image_glob_pattern '**/*.tiff' --level 2 --width 1000 --output_path montage1
This will create a montage from all TIFF files in folder /data
. Each TIFF file is read as a multi-level image, and
level 2 is read for creating the montage.
The --width
argument determines the width in pixel of the output image. The height of the output image is determined
automatically. Note that the width given here, 1000, is suitable only for montages from a very small number of files
(say, 10). See below for more details on this commandline option.
This will create two images in the folder specified by the --output_path montage1
argument, hence outputting the files montage1/montage.jpg
and montage1/montage.png
.
Here’s an example how this could look like for a folder with 6 images, 0.tiff
through 5.tiff
:
Creating montages from a dataset.csv
file
If the montage creation script is only pointed to a folder, without providing a glob pattern,
it assumes that a file dataset.csv
is present. A montage will be created from only the images
listed in dataset.csv
. In addition, an optional label
column will be added to the text that is
overlayed onto the images itself.
The dataset file should be a CSV file, with each row corresponding to a single image.
When working with a dataset.csv
file, the following columns are handled:
Column name |
Contents |
Required? |
---|---|---|
|
The path of the image that should be loaded |
Required |
|
A unique identifier for the slide |
Required |
|
An additional string that will be placed on the montage, This could be |
Optional |
|
The path of an additional image that will rendered next to the image given in |
Optional |
Consider this example dataset file:
image,slide_id,label
2.tiff,ID 2,Label 2
3.tiff,ID 3,Label 3
4.tiff,ID 4,Label 4
5.tiff,ID 5,Label 5
Run montage creation with the following command:
python src/health_cpath/scripts/create_montage.py --dataset /data --level 2 --width 1000 --output_path montage1
This would produce (assuming that the images 2.tiff
, 3.tiff
, 4.tiff
, and 5.tiff
are present in the folder
/data
) a montage similar to this one:
Using inclusion or exclusion lists
When creating montages from a dataset.csv
file, it is possible to create montages from only a specific subset
of rows, or all rows apart from those in a given list.
Use the
--exclude_by_slide_id exclude.txt
argument to point to a file with a list of slide IDs that should be excluded from the montage.Use the
--include_by_slide_id include.txt
argument to point to a file with a list of slide IDs for which the montage should be created.
The files exclude.txt
and include.txt
should contain one slide ID per line.
Other commandline options
Use
--width=20000
to set the width of the output montage image. The height of the output image is determined automatically. Mind that the default value is 60_000, suitable for several hundreds of input images. If you want to try out montage creation on a small set of files (say, 10), ensure that you set the width to a reasonably small value, like--width=1000
Use
--parallel=2
to specify the number of parallel processes that should be used for creating image thumbnails. Thumbnails are created in a first step, using multiple processes, and then the thumbnails are stitched into the final montage in the main process.Use
--backend=cucim
to switch the image reader backend toCuCIM
. The default backend isopenslide
.
Running in Azure
The create_montage.py
script can be run in AzureML by adding 3 commandline arguments.
To set up Azure and AzureML:
Follow the steps in the AzureML onboarding.
At the end of the onboarding you will download a file
config.json
from your AzureML workspace to your repository root folder.To understand datasets, please read through the AzureML datasets documentation. Then create an AzureML datastore that points to your Azure Blob Storage account.
Upload your WSIs to a folder in Azure Blob Storage. This can be done most efficiently via azcopy.
azcopy
can also copy directly across cloud providers, for example from AWS to Azure.
The following command will upload all files in the folder my_test_slides
to a container datasets
in your Azure Blob
Storage account called mystorage
, creating a folder my_test_slides
in the storage account in the process:
azcopy copy my_test_slides https://mystorage.blob.core.windows.net/datasets/ --recursive
The following command will then create a run in AzureML that executes montage creation from that folder:
python src/health_cpath/scripts/create_montage.py --dataset my_test_slides --level 2 --width 1000 --cluster <clustername> --conda_env environment.yml --datastore <datastorename>
In this command, replace the following:
Replace
my_test_slides
with the name of the folder in blob storage where you uploaded your WSIs.clustername
is the name of a compute cluster where your job will execute)datastorename
is the name of an AzureML datastore, essential a pointer to your blob storage account plus the credentials that are necessary to access it. For the above example, the data store needs to point to storage accountmystorage
and containerdatasets
.
The command above will only run for a minute or less - it will mostly create a snapshot of the code and send that off to the cloud for execution. At the end you will see a link printed out that takes you to the AzureML portal, where you can monitor the progress of the run.
Once the run is completed, you will find two files montage.jpg
and montage.png
in the tab “Outputs” of the run, and
an option to download it to your machine.
Histopathology Scripts
Please see links below for details on the available arguments for each of the histopathology scripts.
create_montage
This script can create a high-resolution montage of all images in a folder or datasets.
Create an overview image with thumbnails of all slides in a dataset.
usage: python create_montage.py --dataset <dataset_folder> --image_glob_pattern '**/*.tiff' --width 1000
Named Arguments
- --cluster
The name of the GPU or CPU cluster inside the AzureML workspacethat should execute the job. To run on your local machine, omit this argument.
Default: “”
- --datastore
The name of the AzureML datastore where the dataset is defined.
Default: “”
- --dataset
The name of the AzureML dataset to use for creating the montage. The dataset will be mounted automatically. Use an absolute path to a folder on the local machine to bypass mounting.
Default: “”
- --conda_env
The Conda environment file that should be used when submitting the present run to AzureML. If not specified, the hi-ml-cpath environment file will be used.
Default: hi-ml/hi-ml-cpath/environment.yml
- --wait_for_completion
If True, wait for AML Run to complete before proceeding. If False, submit the run to AML and exit
Default: False
- --docker_shm_size
The shared memory in the Docker image for the AzureML VMs.
Default: “100g”
- --workspace_config_path
The path to the AzureML workspace configuration file. If not specified, the configuration file in the current folder or one of its parents will be used.
- --display_name
The display name of the AzureML run. If not specified, a default name will be used.
Default: “”
- --level
Resolution downsample level, e.g. if lowest resolution is 40x and the available downsample levels are [1.0, 4.0, 16.0] then level = 1 corresponds to 10x magnification
Default: 1
- --exclude_by_slide_id
Provide a file that contains slide IDs that should be excluded. File format is CSV, the first column is used as the slide ID. If the file is empty, no slides will be excluded.
- --include_by_slide_id
Provide a file that contains slide IDs that should be included. File format is CSV, the first column is used as the slide ID. If the file is empty, no montage will be produced.
- --image_glob_pattern
When provided, use this pattern in rglob to find the files that should be included in the montage. Example: ‘**/*.tiff’ to find all TIFF files recursive. You may have to escape the pattern in your shell.
Default: “”
- --width
The width of the montage in pixels
Default: 60000
- --output_path
The folder where the montage will be saved
Default: outputs
- --parallel
The number of parallel processes to use when creating the montage.
Default: 8
- --backend
The backend to use for reading the slides. Can be ‘openslide’ or ‘cucim’
Default: “openslide”
HI-ML Multimodal Toolbox
This toolbox provides models for multimodal health data. The code is available on GitHub and Hugging Face 🤗.
Getting started
The best way to get started is by running the phrase grounding notebook and the examples. All the dependencies will be installed upon execution, so Python 3.9 and Jupyter are the only requirements to get started.
The notebook can also be run on Binder, without the need to download any code or install any libraries:
Installation
The latest version can be installed using pip
:
pip install --upgrade hi-ml-multimodal
Development
For development, it is recommended to clone the repository and set up the environment using conda
:
git clone https://github.com/microsoft/hi-ml.git
cd hi-ml-multimodal
make env
This will create a conda
environment named multimodal
and install all the dependencies to run and test the package.
You can visit the API documentation for a deeper understanding of our tools.
Examples
For zero-shot classification of images using text prompts, please refer to the example script that utilises a small subset of Open-Indiana CXR dataset for pneumonia detection in chest X-ray images. Please note that the examples and models are not intended for deployed use cases (commercial or otherwise), which is currently out-of-scope.
Hugging Face 🤗
While the GitHub repository provides examples and pipelines to use our models, the weights and model cards are hosted on Hugging Face 🤗.
Credit
If you use our code or models in your research, please cite our recent ECCV and CVPR papers:
Boecking, B., Usuyama, N. et al. (2022). Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_1
Bannur, S., Hyland, S., et al. (2023). Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing. In: CVPR 2023.
BibTeX
@InProceedings{10.1007/978-3-031-20059-5_1,
author="Boecking, Benedikt and Usuyama, Naoto and Bannur, Shruthi and Castro, Daniel C. and Schwaighofer, Anton and Hyland, Stephanie and Wetscherek, Maria and Naumann, Tristan and Nori, Aditya and Alvarez-Valle, Javier and Poon, Hoifung and Oktay, Ozan",
editor="Avidan, Shai and Brostow, Gabriel and Ciss{\'e}, Moustapha and Farinella, Giovanni Maria and Hassner, Tal",
title="Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing",
booktitle="Computer Vision -- ECCV 2022",
year="2022",
publisher="Springer Nature Switzerland",
address="Cham",
pages="1--21",
isbn="978-3-031-20059-5"
}
@inproceedings{bannur2023learning,
title={Learning to Exploit Temporal Structure for Biomedical Vision{\textendash}Language Processing},
author={Shruthi Bannur and Stephanie Hyland and Qianchu Liu and Fernando P\'{e}rez-Garc\'{i}a and Maximilian Ilse and Daniel C. Castro and Benedikt Boecking and Harshita Sharma and Kenza Bouzid and Anja Thieme and Anton Schwaighofer and Maria Wetscherek and Matthew P. Lungren and Aditya Nori and Javier Alvarez-Valle and Ozan Oktay},
booktitle={Conference on Computer Vision and Pattern Recognition 2023},
year={2023},
url={https://openreview.net/forum?id=5jScn5xsbo}
}
Training of self-supervised models
The code present in the InnerEye/ML/SSL folder allows you to train self-supervised models using SimCLR or BYOL. This code runs as a “ bring-your-own-model” self-contained module ( see docs/bring_your_own_model.md) .
Here, we provide implementations for four datasets to get you kickstarted with self-supervised models:
Medical Chest-Xray datasets: RSNA Pneumonia Detection Challenge data (30k scans, labels indicating presence of pneumonia), NIH Chest-Xray (112k Chest-Xray scans) or CheXpert (228k scans).
Multi-dataset support
During self-supervised training, a separate linear classifier is trained on top of learnt image embeddings. In this way,
users can continuously monitor the representativeness of learnt image embeddings for a given downstream classification
task. More importantly, the framework allows users to specify multiple datasets and data loaders for SimCLR/BYOL
training and evaluation. For instance, a BYOL encoder can be learnt using a dataset that does not contain any target
labels and embeddings can be evaluated throughout training on a separate dataset containing class labels. To enable this
functionality, our SSLContainer takes two dataset names parameters: ssl_training_dataset_name
to indicate which
dataset to use for SSL training and linear_head_dataset_name
to indicate which dataset for the classification task
used to monitor embeddings quality during training.
Quick start guide
Here we described how to quickly start a training job with our ready made configs.
Example 1: training a SimCLR or BYOL model on CIFAR10
To kick-off a training for a SimCLR and BYOL models on CIFAR10, simply run
python ML/runner.py --model=CIFAR10BYOL
python ML/runner.py --model=CIFAR10SimCLR
For this dataset, it will automatically take care of downloading the dataset to your machine prior to starting training.
Example 2: training a BYOL model on Chest-Xray data
Step 0: Get the data
If you run on your local machine:
Prior to starting training a model on this dataset, you will need to download it from Kaggle to your machine:
To use the RSNA Pneumonia Detection Challenge data: please download from here. Make sure to download all images and the
dataset.csv
file to your data folder. Please note that the labels are here merely used for monitoring purposes.To get the NIH dataset: please download from here. Make sure you include also the csv files in your data folder, the code assumes the dataset files and all the images lie in your data folder as downloaded from Kaggle. In particular, do not modify the original csv filenames (e.g. the code expects to find the
Data_Entry_2017.csv
file within the data directory).To get the CheXpert data: please download from here, the code assumes the dataset files and all the images lie in your data folder as downloaded.
If you run on AML
In order to train models on AML you will need to upload the datasets listed above to your storage account and get their dataset_id to pass to your model config.
Step 1: Update your model config
We provide sample configs to train models on NIH data both for BYOL and SimCLR. You can use them as they are except for the dataset location fields:
If you’re running locally set the
local_dataset
parameter to point to your data folder.If you’re running on AML: you need to update the
RSNA_AZURE_DATASET_ID
andNIH_AZURE_DATASET_ID
variables to point to the name of the NIH and Kaggle dataset in your own workspace.
Step 2: Launch the training job
Example to train a SSL model with BYOL on the NIH dataset and monitor the embeddings quality on the Kaggle RSNA Pneumonia Challenge classification task:
python ML/runner.py --model=NIH_RSNA_BYOL
Configuring your own SSL models:
About SSLContainer configuration
All SSL models are derived from the SSLcontainer class. See the config class in ML/configs/ssl for some examples of specific model configurations (all derived from this container).
If you wish to create your own model config for SSL training, you will need to create a child class and parametrize it with the following available arguments:
ssl_training_dataset_name
: the name of the dataset to train the SSL encoder on, a member of the SSLDatasetName class (don’t forget to update this class if you’re adding a new dataset ;)),linear_head_dataset_name
: the name of the dataset to train to linear head on top of the classifier for monitoring purposes,azure_dataset_id
: the id of the AML dataset to use for SSL training,extra_azure_dataset_ids
: dataset_id to use for linear head training, expected to be provided as a list [data-id],ssl_encoder
: name of the encoder to train, member ofEncoderName
class, currently supported are resnet50, resnet101 and densenet121,ssl_training_type
: which SSL algorithm to use, member ofSSLType
choice between BYOL and SimCLR,ssl_training_batch_size
: batch size of SSL training. This is the number of examples processed by a single GPU. Multiply this by the number of GPUs to get the effective batch size.linear_head_batch_size
: batch size for linear head training (used for monitor of SSL embeddings quality). This is the number of examples processed by a single GPU. Multiply this by the number of GPUs to get the effective batch size.ssl_augmentation_config
: path to yaml config for augmentation to use during SSL training. Only used for NIH/Kaggle datasets.linear_head_augmentation_config
: path to yaml config for augmentation to use for linear head training. Only used for NIH/Kaggle datasets,use_balanced_binary_loss_for_linear_head
: whether to use balanced loss for linear head training,random_seed
: seed for the run,num_epochs
: number of epochs to train for.
In case you wish to first test your model locally, here some optional arguments that can be useful:
local_dataset
: path to local dataset, if passed the azure dataset will be ignoredis_debug_model
: if True it will only run on the first batch of each epochdrop_last
: if False (True by default) it will keep the last batch also if incomplete
Creating your own datamodules:
To use this code with your own data, you will need to:
Define your own Lightening Container that inherits from
SSLContainer
as described in the paragraph above.Create a dataset class that reads your new dataset, inheriting from both
VisionDataset
andDatasetWithReturnIndex
. See for example how we constructedRSNAKaggleCXR
class. WARNING: the first positional argument of your dataset class constructor MUST be the data directory (“root”), as VisionDataModule expects this in the prepare_data step.In your own container update the
DatasetToClassMapping
member of the class so that the code knows which data class to associate to your new dataset name.Create a yaml configuration file that contains the augmentations specific to your dataset. The yaml file will be consumed by the
create_transforms_from_config
function defined in theInnerEye.ML.augmentations.transform_pipeline
module (see next paragraph for more details). Alternatively, overwrite the_get_transforms
method. To simplify this step, we have defined a series of standard operations inSSL/transforms_utils.py
. You could for example construct a transform pipeline similar to the one created insidecreate_transform_from_config
inside your own method.Update all necessary parameters in the model config (cf. previous paragraph)
Once all these steps are updated, the code in the base SSLContainer class will take care of creating the corresponding datamodules for SSL training and linear head monitoring.
About the augmentation configuration yaml file
The augmentations used for SSL training for all Chest-X-rays models are parametrized via a yaml config file. The path to
this config as to be passed in the model config (cf. section above). We provide two defaults
configs: cxr_ssl_encoder_augmentations.yaml
is used to define the augmentations used for BYOL/SimCLR training;
the cxr_linear_head_augmentations.yaml
config defines the augmentations to used for the training of the linear head
(used for monitoring purposes). The meaning of each config argument is detailed in ssl_model_config.py
WARNING: this file will be ignored for CIFAR examples where we use the default pl-bolts augmentations.
Finetuning a linear head on top of a pretrained SSL model.
Alongside with the modules to train your SSL models, we also provide examplary modules that allow you to build a
classifier on top of a pretrained SSL model. The base class for these modules is SSLClassifierContainer
. It builds on
top of the SSLContainer
with additional command line arguments allowing you to specify where to find the checkpoint
for your pretrained model. For this you have two options:
If you are running locally, you can provide the local path to your pretrained model checkpoint via
--src_checkpoint
.If your are running on AML, use the
pretraining_run_recovery_id
field. Providing this field, will mean that AML will automatically download the checkpoints to the current node, will pick the latest checkpoint to build the classifier on top. Beware not to confusepretraining_run_recovery_id
withrun_recovery_id
as the latter is use to continue training on the same model (which is not the case here).
The code will then automatically extract the encoder part of the loaded SSL model to initialize your classifier. You can
then also choose whether you want to freeze the weights of your encoder or not via the --freeze_encoder=True/False
argument. By default, this is set to True.
Example for CIFAR
We provide an example of such a classifier container for CIFAR named SSLClassifierCIFAR
. To launch a finetuning run
for this model on CIFAR10, just run
python ML/runner.py --model=SSLClassifierCIFAR --pretraining_run_recovery_id={THE_ID_TO_YOUR_SSL_TRAINING_JOB}
Example for CXR
Similarly, we provide class to allow you to simply start a finetuning job for CXR model in CXRImageClassifier
. By
default, this will launch a finetuning job on the RSNA Pneumonia dataset. To start the run:
python ML/runner.py --model=CXRImageClassifier --pretraining_run_recovery_id={THE_ID_TO_YOUR_SSL_TRAINING_JOB}
or for a local run
python ML/runner.py --model=CXRImageClassifier --src_checkpoint={LOCAL_PATH_TO_YOUR_SSL_CHECKPOINT}
Contributing to this toolbox
We welcome all contributions that help us achieve our aim of speeding up Machine Learning (ML) / AI research in health and life sciences.
Examples of contributions are
Data loaders for specific health & life sciences data
Network architectures and components for deep learning models
Tools to analyze and/or visualize data
Bug reports
Documentation improvements, including fixing typos
Suggestions about codebase improvements
All contributions to the toolbox need to come with unit tests, and will be reviewed when a pull request is started.
If in doubt, reach out to the core hi-ml
team before starting your work.
Please look through the existing folder structure to find a good home for your contribution.
For a full set of design and guidelines, please check our coding guidelines documentation.
Design and Coding Guidelines
Design Phase
For non-trivial changes, please communicate your plans early on to reach maximum impact and to avoid duplicate work. Create an issue or a discussion that describes the problem that you are trying to solve, and describe why this adds a positive contribution to the project.
Tag one of the core hi-ml
team (@ant0nsc, @mebristo, @fepegar) to get input on your problem.
In particular, you should find out if this has already been worked on or thought about.
This will help you to get to know about existing code fragments that can simplify your task, and integrate better with existing code.
We recommend to work with a core hi-ml
team member to create a design for your work.
Depending on the size of the task at hand, a design can be anywhere between a short description of your plan in a GitHub
discussion or issue, or it can be a shared document where you list your design options and invite feedback.
Your design partner will later also be well positioned to review your pull request (PR) because they will be already
familiar with your plan.
When working on a large contribution, we suggest to break it down into a set of small PRs that are more manageable for you and for the reviewers (see below in the section “Scope of PRs”)
Setting up your development environment
Please see our detailed instructions.
Coding Guidelines
Naming
Please follow the general PEP 8 Python rules for naming:
Variables, function and method names should use
snake_case
(lower case with under scores separating words)Class names should follow
PascalCase
(first letter of each word capitalized).
To improve readability, functions or methods that return boolean values should follow a is_...
, has_...
, use...
pattern, like is_status_ok
instead of ok
.
Static Analysis, Linting and Styling
We use flake8
as a linter, mypy
and pyright for static typechecking, and black
for styling. All these tools run as part of the PR workflow, and must run without errors for a contribution to be accepted.
mypy
requires that all functions and methods carry type annotations.
See the mypy
documentation for more information.
We highly recommend to run all those tools before pushing the latest changes to a PR.
If you have make
installed, you can run both tools in one go via make check
(from the repository root folder).
Black styling
We use black
for styling all of our code. To pass the tests carried out in make check
(or make black
) you can run the following command to reformat your changed file as per our guidelines:
black <path_to_python_file>
You can also run black
on the whole repository by running the following command from the head of the repo:
black .
Black VS Code Integration
If you’re using VS Code as your IDE, you can avoid having to call black
every time you want to style some files by installing the black
formatter extension and adding the following to your vscode settings.json
file:
"[python]": {
"editor.formatOnSave": true,
"editor.defaultFormatter": "ms-python.black-formatter",
},
Now every time you save a file it will automatically be formatted by black
.
Black pre-commit hook
If you do not style your code yourself, then the pre-commit hook on your PR will attempt to do this automatically. Please be aware that this may cause conflicts if you’re working on an open pull request. We highly recommend styling your code before submitting!
Documentation
For general information around docstrings, please check PEP 257.
We use Sphinx for generating documentation.
We recommend using a VSCode extension for auto-generating documentation templates, for example njpwerner.autodocstring
.
This extension is already pre-configured in our VSCode workspace settings file.
Reminders about docstrings:
There must be a separating line between a function description and the documentation for its parameters.
In multi-line parameter descriptions, continuations on the next line must be indented.
Sphinx will merge the class description and the arguments of
__init__
method. Hence, there is no need to write any text in the docstring for__init__
other than the arguments.Use
>>>
to include code snippets.
To generate the Sphinx documentation on your machine, run
cd docs
make html
Then open the results in build/html/index.html
.
Example:
class Foo:
"""This is the class description.
The following block will be pretty-printed by Sphinx. Note the space between >>> and the code!
Usage example:
>>> from module import Foo
>>> foo = Foo(bar=1.23)
"""
ANY_ATTRIBUTE = "what_ever."
"""Document class attributes after the attribute. Should end with a period."""
def __init__(self, bar: float = 0.5) -> None:
"""
:param bar: This is a description for the constructor argument.
Long descriptions should be indented.
"""
self.bar = bar
def method(self, arg: int) -> None:
"""Short method description, followed by an empty line. Sentences should end with a period.
Longer descriptions can be added as well.
Argument names like ``arg`` are rendered nicely if enclosed in double backticks.
:param arg: This is a description for the method argument.
Long descriptions should be indented.
:raises ValueError: If something bad happens.
"""
do_something() # comments should start with " # "
For more information, check the Sphinx-RTD tutorial.
Testing
Why do we write unit tests?
If you think that there are sensitive parts in your code, where hidden errors can impact the project outcome, please take your time to protect parts of your code. As the team grows, so does the number of users of the repository, and we may not be able to monitor changes on a frequent basis, but your tests can ensure that the expected behavior is preserved. We are not making it strictly mandatory, however, contributors should take into account that their code may have negative downstream impact.
We do not need to test every part of the code depending on the project constraints, e.g., quality and test-coverage vs learning and hypothesis-testing. However, we expect a justification in each PR for this decision and it should be reviewed by team members. Please regularly monitor the code coverage results that are posted as comments on each pull request.
In particular, please do write unit tests for anything that affects training /results, such as data preprocessing, losses, and metrics.
Testing Do’s / Don’ts
DO write unit tests for each new function or class that you add.
DO extend unit tests for existing functions or classes if you change their core behaviour.
DO try your best to write unit tests that are fast. Very often, this can be done by reducing data size to a minimum. Also, it is helpful to avoid long-running integration tests, but try to test at the level of the smallest involved function.
DO ensure that your tests are designed in a way that they can pass on the local machine, even if they are relying on specific cloud features. If required, use
unittest.mock
to simulate the cloud features, and hence enable the tests to run successfully on your local machine.DO run all unit tests on your dev machine before submitting your changes. The test suite is designed to pass completely also on your development machine.
DO NOT rely only on the test results in the cloud (i.e., run test locally before submitting). Apart from the obvious delay in getting your results, CI runs trigger AzureML runs on GPU machines that have a far higher CO2 footprint than your dev machine.
When fixing a bug, the suggested workflow is to first write a unit test that shows the invalid behaviour, and only then start to code up the fix.
Testing of Scripts
Depending on the project needs, we may write scripts for doing one-off operations (as an example, have a look at himl_download.py. How carefully should we test those?
Any scripts that perform operations that we anticipate others to need as well should be designed for re-usability and be tested. “Designed for re-use” here would mean, for example: The script should not contain any hard-coded setup that is specific to my machine or user account. If that’s not achievable, document carefully what others need to prepare so that they can run this script.
Scripts are inherently difficult to test. Testing becomes a lot easier if the script is mainly a front-end to functions that live somewhere else in the codebase.If the script is written as a thin wrapper around library functions, these library functions can be tested in isolation as part of the normal unit test suite.
Writing arguments parsers is particularly error prone.
Consider using automatically generated parsers, like we use in the hi-ml
:
Starting point is a class that describes inputs to a function.
A parser can be generated automatically from these classes.
Lastly, if you are considering adding a script to your project, also consider the following: If the script should be called, for example, after an AzureML run to collect results, can this be automated further, and make the script obsolete? Reasons for this approach:
People tend to forget that there is a script to do X already, and may re-do the task in question manually.
Any script that requires input from the user also has a chance to be provided with the wrong input, leading to friction or incorrect results. In a programmatic scenario, where the script is called automatically, this chance of errors is greatly minimized.
What not to commit
DO NOT check in files taken from or derived from private datasets.
DO NOT check in any form of credentials, passwords or access tokens.
Do not check in any code that contains absolute paths (for example, paths that only work on your machine).
Avoid checking in large files (anything over 100 kB). If you need to commit large files, consider adding them via Git LFS.
Jupyter Notebooks
Notebooks can easily become obsolete over time and may not work with code changes. Also, testing them is generally difficult. Please follow these guidelines for notebooks:
If you are planning to use a notebook, avoid committing into the repository unless it is part of a project demo.
If the notebook is used to document results, then please render the notebook and place the results in a separate document outside the repository.
Bear in mind that notebooks are also an easy way of leaking sensitive data: They can for example contain images derived from private datasets. You should clear the notebook outputs before committing it.
Review / Pull Requests
Scope of Pull Requests
PRs should ideally implement a single change. If in doubt, err on the side of making the PR too small, rather than too big.
Small PRs help reduce the load on the reviewer.
Avoid adding unrelated changes to an existing PR.
PRs should be modular: we can iterate on PRs, and any positive delta is a contribution.
Please follow the guidance on Github flow.
Try gauging the value of your contribution for yourself by asking the following questions:
Will this change bring the team closer to achieving their project goals?
Will someone else understand my code?
Will they be able to use my code?
Will they be able to extend or build on top of my code?
Pull Request Process
When creating a PR, do add a summary of your contribution in the PR description.
The template PR description also contains a checklist for the PR author.
Link your PR to a GitHub issue that describes the problem/feature that you are working on.
For collecting early feedback on your work, please use a Draft Pull Request. These PRs are marked with a grey icon in the Github UI, and send a clear signal that the code there is not yet ready for review. When submitting a draft PR, all the checks will be run as for a normal PR.
Once your work is ready for review, click the “Ready for Review” button on the Github PR page, and, if you want, assign reviewers for your PR.
Pull Request Titles
To enable good auto-generated changelogs, we prefix all PR titles with a category string, like BUG: Fix out of bounds error when using small images
.
Those category prefixes must be in upper case, followed by a colon (:
).
Valid categories are
ENH
for enhancements, new capabilitiesBUG
for bugfixesSTY
for stylistic changes (for example, refactoring) that do not impact the functionalityDOC
for changes to documentation onlyDEL
for removing something from the codebaseTEST
for adding or modifying testsFIX
to fix something that is not aBUG
such as a typoMNT
maintenance, to upgrade package version, packaging, linting, CI etc.PERF
performance related
Notes on Branching
We should treat the main branch as a collection of code that is
Of high-quality
Readable and suitable for re-use
Well documented
Use a dedicated dev branch for development, debugging and analysis work. Once you are sure that the work in the dev branch adds enough value to reach our project objectives, then use a pull request to get your contribution into the main branch.
Setting up the Development Environment
Development environment
We suggest using Visual Studio Code (VSCode), available for multiple platforms here.
On Windows system, we recommend using WSL, the Windows Subsystem for Linux, because some PyTorch features are not available on Windows.
Inside VSCode, please install the extensions that are recommended for this project - they are available in .vscode/extensions.json
in the
repository root.
Opening the repository
Once you have the repository on your computer, you can open either all projects at once or individual projects separately in VSCode.
To open all projects at once, use VSCode’s “Open Workspace from File”, and select
himl-projects.code-workspace
.To open individual projects, use VSCode’s “Open Folder”, and select one of the folders
hi-ml-azure
,hi-ml
, orhi-ml-cpath
Creating a Conda environment
Different projects in this repository use different Conda environments:
The
himl
Conda environment should be used for work on thehi-ml
andhi-ml-azure
projects.The
HimlHisto
Conda environment should be used for work onhi-ml-cpath
.
Please select the right Python interpreter for your project (or all projects if using the himl-projects
workspace)
inside VSCode, by choosing “Python: Select Interpreter” from the command palette (Ctrl-Shift-P on VSCode for Windows)
To create the Conda environment himl
, please use either
conda env create --file hi-ml/environment.yml
or use make
in the repository root folder:
make env
Please see the project-specific README files for instructions how to set up the other Conda environments.
Installing pyright
We are using static typechecking for our code via mypy
and pyright
. The latter requires a separate installation
outside the Conda environment. For WSL, these are the required steps (see also
here):
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.38.0/install.sh | bash
Close your terminal and re-open it, then run:
nvm install node
npm install -g pyright
Using specific versions of hi-ml
in your Python environments
If you’d like to test specific changes to the hi-ml
package in your code, you can use two different routes:
You can clone the
hi-ml
repository on your machine, and usehi-ml
in your Python environment via a local package install:
pip install -e <your_git_folder>/hi-ml
You can consume an early version of the package from
test.pypi.org
viapip
:
pip install --extra-index-url https://test.pypi.org/simple/ hi-ml==0.1.0.post165
If you are using Conda, you can add an additional parameter for
pip
into the Condaenvironment.yml
file like this:
name: foo
dependencies:
- pip=20.1.1
- python=3.7.3
- pip:
- --extra-index-url https://test.pypi.org/simple/
- hi-ml==0.1.0.post165
Common things to do
The repository contains a makefile with definitions for common operations.
make check
: Runflake8
,mypy
andblack
on the repository.make test
: Runflake8
andmypy
on the repository, then all tests viapytest
make pip
: Install all packages for running and testing in the current interpreter.make conda
: Update the hi-ml Conda environment and activate it
Building documentation
To build the sphinx documentation, you must have sphinx and related packages installed
(see build_requirements.txt
in the repository root). Then run:
cd docs
make html
This will build all your documentation in docs/build/html
.
Setting up your AzureML workspace
In the browser, navigate to the AzureML workspace that you want to use for running your tests.
In the top right section, there will be a dropdown menu showing the name of your AzureML workspace. Expand that.
In the panel, there is a link “Download config file”. Click that.
This will download a file
config.json
. Move that file to both of the foldershi-ml/testhiml
andhi-ml/testazure
The fileconfig.json
is already present in.gitignore
, and will hence not be checked in.
Creating and Deleting Docker Environments in AzureML
Passing a
docker_base_image
intosubmit_to_azure_if_needed
causes a new image to be built and registered in your workspace (see docs for more information).To remove an environment use the az ml environment delete function in the AzureML CLI (note that all the parameters need to be set, none are optional).
Testing
For all of the tests to work locally you will need to cache your AzureML credentials. One simple way to do this is to
run the example in src/health/azure/examples
(i.e. run python elevate_this.py --message='Hello World' --azureml
or
make example
) after editing elevate_this.py
to reference your compute cluster.
When running the tests locally, they can either be run against the source directly, or the source built into a package.
To run the tests against the source directly in the local
src
folder, ensure that there is no wheel in thedist
folder (for example by runningmake clean
). If a wheel is not detected, then the localsrc
folder will be copied into the temporary test folder as part of the test process.To run the tests against the source as a package, build it with
make build
. This will build the localsrc
folder into a new wheel in thedist
folder. This wheel will be detected and passed to AzureML as a private package as part of the test process.
Test discovery in VSCode
All tests in the repository should be picked up automatically by VSCode. In particular, this includes the tests in the hi-ml-cpath
folder, which
are not always necessary when working on the core hi-ml
projects.
Creating a New Release
To create a new package release, follow these steps:
On the repository’s github page, click on “Releases”, then “Draft a new release”
In the “Draft a new release” page, click “Choose a tag”. In the text box, enter a (new) tag name that has the desired version number, plus a “v” prefix. For example, to create package version 0.12.17, create a tag
v0.12.17
. Then choose “+ Create new tag” below the text box.Enter a “Release title” that highlights the main feature(s) of this new package version.
Click “Auto-generate release notes” to pull in the titles of the Pull Requests since the last release.
Before the auto-generated “What’s changed” section, add a few sentences that summarize what’s new.
Click “Publish release”
Troubleshooting
Debugging a test in VSCode fails on Windows
Symptom: Debugging just does not seem to do anything
Check: Debug Console shows error
from _sqlite3 import *: ImportError: DLL load failed: The specified module could not be found.
Fix: see here
Run
conda info --envs
to see where your Conda environment lives, then placesqlite3.dll
into theDLLs
folder inside of the environment
Software Development Process
This document provides a high-level overview of the software development process that our team uses for their repositories (most importantly InnerEye-DeepLearning and hi-ml).
For detailed guidance for developers, please refer to the coding guidelines.
Version Control
Software code versioning is done via GitHub. The code with the highest level of quality control is in the “main” branch. Ongoing development happens in separate branches. Once development in the branch is finished, a Pull Request (PR) process is started to integrate the branch into “main”.
Development Process
The design and development of the software in this repository is roughly separated into an initiation, prototyping, and a finalization phase. The initiation phase can be skipped for minor changes, for example an update to documentation.
Initiation
During the initiation phase, the following steps are carried out:
Collection of a set of requirements.
Creating a suggested design for the change.
Review of the design with member of the core team.
The deliverables of this phase are a detailed design of the proposed change in a GitHub Issue or a separate document.
Prototyping
The engineering owner of the proposed change will create a branch of the current codebase. This branch is separate from the released (main) branch to not affect any current functionality. In this branch, the engineer will carry out the changes as proposed in the design document.
The engineer will also add additional software tests at different levels (unit tests, integration tests) as appropriate for the design. These tests ensure that the proposed functionality will be maintained in the future.
The deliverable of this phase is a branch in the version control system that contains all proposed changes and a set of software tests.
Finalization
At this point, the engineering owner of the proposed change has carried out all necessary changes in a branch of the codebase. They will now initiate a Pull Request process that consists of the following steps:
The code will be reviewed by at least 2 engineers. Both need to provide their explicit approval before the proposed change can be integrated in the “main” branch.
All unit and integration tests will start.
All automatic code checks will start. These checks will verify the following:
Consistency with static typing rules.
Ensure that no passwords or other types of credentials are revealed.
Ensure that none of the used third-party libraries contains high severity software vulnerabilities that could affect the proposed functionality.
For code to be accepted into the “main” branch, the following criteria need to be satisfied:
All unit and integration tests pass.
The code has been reviewed by at least 2 engineers who are members of the core development team. This review will also assess non-quantifiable aspects of the proposed change, such as clarity and readability of the code and quality of the documentation.
Any comments that have been added throughout the review process need to be resolved.
All automated checks pass.
Once all the above criteria are satisfied, the branch will be merged into “main”.
Software Configuration Management
Third party libraries (sometimes called Software of Unknown Provenance, SOUP) are consumed via two package management systems:
Conda
PyPi
Both of those package management systems maintain strict versioning: once a version of a package is published, it cannot be modified in place. Rather, a new version needs to be released.
Our training and deployment code uses Conda environment files that specify an explicit version of a dependent package to
use (for example, lightning_bolts==0.4.0
). The Conda environment files are also version controlled in GitHub. Any
change to a version of a third party library will need to be carried out via the same change management process as a code
change, with Pull Request, review, and all tests passing.
The list of third party software is maintained in GitHub in the Conda configuration file, that is environment.yml
for
Linux environments and environment_win.yml
for Windows environments. For example, this is the latest version of the
environment file for the InnerEye-DeepLearning
repository.
Defect Handling
The handling of any bugs or defects discovered is done via GitHub Issues.
When an Issue is created, the team members will get notified. Open Issues are kept in a list sorted by priority. For example, this is the list for all of the InnerEye-related Issues. Priorities are re-assesed regularly, at least once a week.
The Issue(s) with highest priority are assigned to an engineer. The engineer will then analyze the problem, and possibly request more information to reproduce the problem. Requesting more information is also handled in the GitHub Issue. Once the problem is clearly reproduced, the engineer will start to write a fix for the Issue as well as a test that ensures that the fix does indeed solve the reported problem.
Writing the fix and test will follow the process outlined above in the Prototyping and Finalization sections. In particular, the fix and test will undergo the Pull Request review process before they are merged into the main branch.
Updates to the Development Process
The development process described here is subject to change. Changes will undergo the review process described in this very document, and will be published on GitHub upon completion.
Debugging and Profiling
While using the hi-ml toolbox modules, you might encounter some errors that require running the code step by step in order to track down the source of these errors. Here are some guidelines to help you debug and/or profile hi-ml deep learning training pipelines.
Debugging within VS Code
VS code has a great Debugging Support for many programming languages. Make sure to install and enable the Python Extension for VS Code to debug hi-ml toolbox modules built in Python.
Debugging configs
The hi-ml repository is organised in Multi-root workspaces to account for environment differences among the modules and offer flexibility to configure each module seperatly.
We provide a set of example debugging configs for each of hi-ml module:
VS Code restricts debugging to user-written code only by default. If you want to step through external code and
standard libraries functions, set "justMyCode": false
inside the debugging config block in the launch.json
file.
In particular, if you would like to debug the current file while breaking through external libraries, navigate to
himl-projects.code-workspace
in the repo root and edit the “launch” block as follows:
"launch": {
"configurations": [
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"justMyCode": false
},
],
}
PyTorch Lightning flags for debugging and quick runs
The hi-ml toolbox is built upon PyTorch Lightning (PL) to help you build scalable deep learning models for healthcare and life sciences. Refer to Running ML Experiments with hi-ml for detailed instructions on how to build scalable pipelines within hi-ml.
Whether you’re building a brand new model, or extending an existing one, you might want to make sure that your code runs as expected locally before submitting a job to AzureML. hi-ml supports a set of debugging flags that triggers PyTorch Lightning Trainer arguments to help you detect any potential errors or bugs at early stage.
These are available as part of the TrainerParams and can be used as extra command line arguments with the hi-ml runner.
pl_fast_dev_run
: If set ton
, runs the pipeline for onlyn
batch(es) of train, val and test for only a single epoch. Additionally this flag disables all callbacks and hyperparameters serialization which makes the debugging process very quick. This must be used for debugging purposes only.pl_limit_train_batches
: Limits the training dataset to the given number of batchesn
.pl_limit_val_batches
: Limits the validation dataset to the given number of batchesn
.pl_limit_train_batches
: Limits the test dataset to the given number of batchesn
.
In general, it is very useful to run the following two steps as part of the developement cycle:
Make sure all training, validation and test loops complete properly by running the pipeline with a smaller batch size and
pl_fast_dev_run
argument. Add the following to the hi-ml runner command line:
--bach-size=2 --pl-fast-dev-run=4
Make sure the whole pipeline runs properly end to end, including checkpoints callbacks and hyperparameter serialization by running it with a smaller batch size once again while limiting train/val/test batches for few epochs. Add the following arguments to the hi-ml runner command line:
--bach-size=2 --pl-limit-train-batches=4 --pl-limit-val-batches=4 --pl-limit-test-batches=4 --max_epochs=4
Note: Under the hood, setting pl-fast-dev-run=n
overrides
pl-limit-train-batches=n
, pl-limit-val-batches=n
, pl-limit-train-batches=n
, max_epochs=1
and disables all
callbacks. Please keep in mind that all the above is useful for efficient and quick debugging purposes only.
Profiling Machine Learning Pipelines
PyTorch Lightning supports a set of built-in
profilers that help you identify
bottlenecks in your code during training, testing and inference. You can trigger code profiling through the command line
argument --pl_profiler
that you can set to either
simple
,
advanced
, or
pytorch
.
The profiler outputs will be saved in a subfolder profiler
inside the outputs folder of the run. Give it a try by
adding the following arguments to the hi-ml runner:
--max_epochs=4 --pl-profiler=pytorch
Interpret PyTorch Profiling outputs via Tensorboard
PyTorch Profiler can effectively be interpreted via
TensorBoard dashbord interface that is integrated in VS
Code as part of the Python
extension. Once you have the outputs of the PyTorch Profiler in outputs/YYYY-MM-DDTHHmmssZ_YourCustomContainer/pytorch_profiler
, you can
open the TensorBoard Profiler plugin by launching the Command Palette using the keyboard shortcut CTRL + SHIFT + P (CMD
+ SHIFT + P on a Mac) and typing the “Launch TensorBoard” command.
Next, you will be asked to select the path where the profiler traces are saved. Select another folder and navigate to outputs/YYYY-MM-DDTHHmmssZ_YourCustomContainer/pytorch_profiler
.
You can see Profiler plugin page as shown below. The overview shows a high-level summary of model performance.
You can change the view page in the left dropdown list.
The operator view displays the performance of every PyTorch operator that is executed either on the host or device.
The GPU Kernel panel shows the GPU configuration, GPU usage and Tensor Cores usage. We can see below all kernels’ time spent on GPU.
The trace view shows timeline of profiled operators and GPU kernels. You can select it to see details as below.
For more details on how to interpret and analyze these views, refer to the pytorch official documentation
Advanced profiling arguments
In some scenarios, you might be interested in profiling the memory usage by setting
profile_memory=True
or any of these additional arguments.
You can specify additional profiling arguments by overriding
get_trainer_arguments
in your LightningContainer as shown below. Please make sure to specify your custom profiler under the
profiler
key and properly set dirpath=self.outputs_folder/"profiler"
so that the profiler’s outputs are saved in the
right output folder.
class YourCustomContainer(LightningContainer):
def __init__(custom_param: Any) -> None:
self.custom_param = custom_param
def get_trainer_arguments(self) -> Dict[str, Any]:
return {"profiler": PyTorchProfiler(dirpath=self.outputs_folder/"profiler", with_memory=True, with_stack=True)}
The profiler will record all memory allocation/release events and allocator’s internal state during profiling. The memory view consists of three components as shown in the following.
Finally, the plugin also supports distributed view on profiling DDP with NCCL/GLOO as backend.
Loading Images as Torch Tensors
There are many libraries available that can load png images. Simple examples were made using most of them and time of execution was compared. The goal was to load a png file, either RGB or greyscale, into a torch.Tensor.
The tensor specification was:
shape [3, Height, Width] (for RGB images, in RGB order) or [1, Height, Width] (for greyscale images);
dtype float32;
scaled to between 0.0 and 1.0.
matplotlib
Two methods using matplotlib were compared. The first manipulates the numpy array from the image read before creating the torch tensor, the second uses torchvision to do the transformation.
from pathlib import Path
import matplotlib.image as mpimg
import numpy as np
import torch
import torchvision.transforms.functional as TF
def read_image_matplotlib(input_filename: Path) -> torch.Tensor:
"""
Read an image file with matplotlib and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
# https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.imread.html
# numpy_array is a numpy.array of shape: (H, W), (H, W, 3), or (H, W, 4)
# where H = height, W = width
numpy_array = mpimg.imread(input_filename)
if len(numpy_array.shape) == 2:
# if loaded a greyscale image, then it is of shape (H, W) so add in an extra axis
numpy_array = np.expand_dims(numpy_array, 2)
# transpose to shape (C, H, W)
numpy_array = np.transpose(numpy_array, (2, 0, 1))
torch_tensor = torch.from_numpy(numpy_array)
return torch_tensor
def read_image_matplotlib2(input_filename: Path) -> torch.Tensor:
"""
Read an image file with matplotlib and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
# https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.imread.html
# numpy_array is a numpy.array of shape: (H, W), (H, W, 3), or (H, W, 4)
# where H = height, W = width
numpy_array = mpimg.imread(input_filename)
torch_tensor = TF.to_tensor(numpy_array)
return torch_tensor
OpenCV
Two methods using the Python interface to OpenCV were compared. The first manipulates the numpy array from the image read before creating the torch tensor, the second uses torchvision to do the transformation. Note that OpenCV loads images in BGR format, so they need to be transformed.
from pathlib import Path
import cv2
import numpy as np
import torch
import torchvision.transforms.functional as TF
def read_image_opencv(input_filename: Path) -> torch.Tensor:
"""
Read an image file with OpenCV and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
# https://docs.opencv.org/4.5.3/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56
# numpy_array is a numpy.ndarray, in BGR format.
numpy_array = cv2.imread(str(input_filename))
numpy_array = cv2.cvtColor(numpy_array, cv2.COLOR_BGR2RGB)
is_greyscale = False not in \
((numpy_array[:, :, 0] == numpy_array[:, :, 1]) == (numpy_array[:, :, 1] == numpy_array[:, :, 2]))
if is_greyscale:
numpy_array = numpy_array[:, :, 0]
if len(numpy_array.shape) == 2:
# if loaded a greyscale image, then it is of shape (H, W) so add in an extra axis
numpy_array = np.expand_dims(numpy_array, 2)
numpy_array = np.float32(numpy_array) / 255.0
# transpose to shape (C, H, W)
numpy_array = np.transpose(numpy_array, (2, 0, 1))
torch_tensor = torch.from_numpy(numpy_array)
return torch_tensor
def read_image_opencv2(input_filename: Path) -> torch.Tensor:
"""
Read an image file with OpenCV and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
# https://docs.opencv.org/4.5.3/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56
# numpy_array is a numpy.ndarray, in BGR format.
numpy_array = cv2.imread(str(input_filename))
numpy_array = cv2.cvtColor(numpy_array, cv2.COLOR_BGR2RGB)
is_greyscale = False not in \
((numpy_array[:, :, 0] == numpy_array[:, :, 1]) == (numpy_array[:, :, 1] == numpy_array[:, :, 2]))
if is_greyscale:
numpy_array = numpy_array[:, :, 0]
torch_tensor = TF.to_tensor(numpy_array)
return torch_tensor
Pillow
Pillow is one of the easiest libraries to use because torchvision has a function to convert directly from Pillow images.
from pathlib import Path
from PIL import Image
import torch
import torchvision.transforms.functional as TF
def read_image_pillow(input_filename: Path) -> torch.Tensor:
"""
Read an image file with pillow and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
pil_image = Image.open(input_filename)
torch_tensor = TF.to_tensor(pil_image)
return torch_tensor
SciPy
SciPy is also easy to use because it loads images into a numpy array of the expected shape so that it can easily be transformed into a torch tensor.
from pathlib import Path
import imageio
import torch
import torchvision.transforms.functional as TF
def read_image_scipy(input_filename: Path) -> torch.Tensor:
"""
Read an image file with scipy and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
numpy_array = imageio.imread(input_filename)
torch_tensor = TF.to_tensor(numpy_array)
return torch_tensor
SimpleITK
SimpleITK requires a two step process to load an image and extract the data as a numpy array, but it is then in the correct format.
from pathlib import Path
import SimpleITK as sitk
import torch
import torchvision.transforms.functional as TF
def read_image_sitk(input_filename: Path) -> torch.Tensor:
"""
Read an image file with SimpleITK and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
itk_image = sitk.ReadImage(str(input_filename))
numpy_array = sitk.GetArrayFromImage(itk_image)
torch_tensor = TF.to_tensor(numpy_array)
return torch_tensor
scikit-image
scikit-image is also very simple to use, since it loads the image as a numpy array in the correct format.
from pathlib import Path
from skimage import io
import torch
import torchvision.transforms.functional as TF
def read_image_skimage(input_filename: Path) -> torch.Tensor:
"""
Read an image file with scikit-image and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
numpy_array = io.imread(input_filename)
torch_tensor = TF.to_tensor(numpy_array)
return torch_tensor
numpy
For comparison, the png image data was saved in the numpy native data format and then reloaded.
from pathlib import Path
import numpy as np
import torch
import torchvision.transforms.functional as TF
def read_image_numpy(input_filename: Path) -> torch.Tensor:
"""
Read an Numpy file with Torch and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
numpy_array = np.load(input_filename)
torch_tensor = torch.from_numpy(numpy_array)
return torch_tensor
torch
Again, for comparison, the png image data was saved in the torch tensor native data format and then reloaded.
from pathlib import Path
import torch
def read_image_torch2(input_filename: Path) -> torch.Tensor:
"""
Read a Torch file with Torch and return a torch.Tensor.
:param input_filename: Source image file path.
:return: torch.Tensor of shape (C, H, W).
"""
torch_tensor = torch.load(input_filename)
return torch_tensor
Results
All the above methods were ran against 122 small test images, repeated 10 times. So in total there were 1220 calls to each of the functions.
RGB Images
For 61 RGB images of size 224 x 224 pixels and 61 of size 180 x 224 pixels, repeated 10 times, there are the following timings:
Function |
Total time (s) |
---|---|
read_image_matplotlib |
9.81336 |
read_image_matplotlib2 |
9.96016 |
read_image_opencv |
12.4301 |
read_image_opencv2 |
12.6227 |
read_image_pillow |
16.2288 |
read_image_scipy |
17.9958 |
read_image_sitk |
63.6669 |
read_image_skimage |
18.273 |
read_image_numpy |
7.29741 |
read_image_torch2 |
7.07304 |
Greyscale Images
Similarly, with greyscale versions of the RGB images:
Function |
Total time (s) |
---|---|
read_image_matplotlib |
8.32523 |
read_image_matplotlib2 |
8.26399 |
read_image_opencv |
11.6838 |
read_image_opencv2 |
11.7935 |
read_image_pillow |
15.7406 |
read_image_scipy |
17.9061 |
read_image_sitk |
71.8732 |
read_image_skimage |
18.0698 |
read_image_numpy |
7.94197 |
read_image_torch2 |
7.73153 |
The recommendation therefore is to use matplotlib mpimg.imread
to load the image and TF.to_tensor
to transform the numpy array to a torch tensor. This is almost as fast as loading the data directly in a native numpy or torch format.
Loading Images as Numpy Arrays
Alternatively, a numpy array may be required with an equivalent form to PIL:
shape [Height, Width, 3] (for RGB images), in RGB order or [Height, Width] (for greyscale images);
dtype float;
range between 0.0 and 255.0.
Pillow
If the image is known to be a png then a shortcut can be taken, which is quicker:
from pathlib import Path
import numpy as np
from PIL import PngImagePlugin
from PIL import Image
def read_image_pillow2(input_filename: Path) -> np.array: # type: ignore
"""
Read an image file with pillow and return a numpy array.
:param input_filename: Source image file path.
:return: numpy array of shape (H, W), (H, W, 3).
"""
with Image.open(input_filename) as pil_png:
return np.asarray(pil_png, np.float)
def read_image_pillow3(input_filename: Path) -> np.array: # type: ignore
"""
Read an image file with pillow and return a numpy array.
:param input_filename: Source image file path.
:return: numpy array of shape (H, W), (H, W, 3).
"""
with PngImagePlugin.PngImageFile(input_filename) as pil_png:
return np.asarray(pil_png, np.float)
SciPy
Similarly, using SciPy:
from pathlib import Path
import imageio
import numpy as np
def read_image_scipy2(input_filename: Path) -> np.array: # type: ignore
"""
Read an image file with scipy and return a numpy array.
:param input_filename: Source image file path.
:return: numpy array of shape (H, W), (H, W, 3).
"""
numpy_array = imageio.imread(input_filename).astype(np.float)
return numpy_array
Results
The three above methods were tested against the same images as above.
RGB Images
For 61 RGB images of size 224 x 224 pixels and 61 of size 180 x 224 pixels, repeated 10 times, there are the following timings:
Function |
Total time (s) |
---|---|
read_image_pillow2 |
44.8641 |
read_image_pillow3 |
18.1665 |
read_image_scipy2 |
51.8801 |
Greyscale Images
Similarly, with greyscale versions of the RGB images:
Function |
Total time (s) |
---|---|
read_image_pillow2 |
38.3468 |
read_image_pillow3 |
14.664 |
read_image_scipy2 |
39.6123 |
Whole Slide Images
Computational Pathology works with image files that can be very large in size, up to many GB. These files may be too large to load entirely into memory at once, or at least too large to act as training data. Instead they may be split into multiple tiles of a much smaller size, e.g. 224x224 pixels before being used for training. There are two popular libraries used for handling this type of image:
but they both come with trade offs and complications.
In development there is also tifffile, but this is untested.
OpenSlide
There is a Python interface for OpenSlide at openslide-python, but this first requires the installation of the OpenSlide library itself. This can be done on Ubuntu with:
apt-get install openslide-tools
On Windows follow the instructions here and make sure that the install directory is added to the system path.
Once the shared library/dlls are installed, install the Python interface with:
pip install openslide-python
cuCIM
cuCIM is much easier to install, it can be done entirely with the Python package: cucim. However, there are the following caveats:
It requires a GPU, with NVIDIA driver 450.36+
It requires CUDA 11.0+
It supports only a subset of tiff image files.
The suitable AzureML base Docker images are therefore the ones containing cuda11
, and the compute instance must contain a GPU.
Performance
An exploratory set of scripts for comparing loading images with OpenSlide or cuCIM, and performing tiling using both libraries can be found at slide_image_loading
.
Loading and saving at lowest resolution
Four test tiff files are used:
a 44.5 MB file with level dimensions: ((27648, 29440), (6912, 7360), (1728, 1840))
a 19.9 MB file with level dimensions: ((5888, 25344), (1472, 6336), (368, 1584))
a 5.5 MB file with level dimensions: ((27648, 29440), (6912, 7360), (1728, 1840)), but acting as a mask
a 2.1 MB file with level dimensions: ((5888, 25344), (1472, 6336), (368, 1584)), but acting as a mask
For OpenSlide the following code:
with OpenSlide(str(input_file)) as img:
count = img.level_count
dimensions = img.level_dimensions
print(f"level_count: {count}")
print(f"dimensions: {dimensions}")
for k, v in img.properties.items():
print(k, v)
region = img.read_region(location=(0, 0),
level=count-1,
size=dimensions[count-1])
region.save(output_file)
took an average of 29ms to open the file, 88ms to read the region, and 243ms to save the region as a png.
For cuCIM the following code:
img = cucim.CuImage(str(input_file))
count = img.resolutions['level_count']
dimensions = img.resolutions['level_dimensions']
print(f"level_count: {count}")
print(f"level_dimensions: {dimensions}")
print(img.metadata)
region = img.read_region(location=(0, 0),
size=dimensions[count-1],
level=count-1)
np_img_arr = np.asarray(region)
img2 = Image.fromarray(np_img_arr)
img2.save(output_file)
took an average of 369ms to open the file, 7ms to read the region and 197ms to save the region as a png, but note that it failed to handle the mask images.
Loading and saving as tiles at the medium resolution
Test code created tiles of size 224x224 pilfes, loaded the mask images, and used occupancy levels to decide which tiles to create and save from level 1 - the middle resolution. This was profiled against both images, as above.
For cuCIM the total time was 4.7s, 2.48s to retain the tiles as a Numpy stack but not save them as pngs. cuCIM has the option of cacheing images, but is actually made performance slightly worse, possibly because the natural tile sizes in the original tiffs were larger than the tile sizes.
For OpenSlide the comparable total times were 5.7s, and 3.26s.
Changelog
Early versions of this toolbox used a manually created changelog. As of March 2022, we have switched to using Github’s auto-generated changelog. If you would like to view the changelog for a particular release, you can do so on the Releases page: Each release contains a link for “Full Changelog”
Changelog for Versions before March 2022
0.1.14
Added
(#227) Add TransformerPooling.
(#179) Add GaussianBlur and RotationByMultiplesOf90 augmentations. Added torchvision and opencv to the environment file since it is necessary for the augmentations.
(#193) Add transformation adaptor to hi-ml-histopathology.
(#178) Add runner script for running ML experiments.
(#181) Add computational pathology tools in hi-ml-histopathology folder.
(#187) Add mean pooling layer for MIL.
(#186) Add inference to hi-ml runner.
(#198) Add cross-validation to hi-ml runner.
(#198) Improved editor setup for VSCode.
Changed
(#227) Pooling constructor is outside of DeepMIL and inside of BaseMIL now.
(#198) Model config loader is now more flexible, can accept fully qualified class name or just top-level module name and class (like histopathology.DeepSMILECrck)
(#198) Runner raises an error when Conda environment file contains a pip include (-r) statement
(#196) Show current workspace name in error message.
Fixed
([#267]https://github.com/microsoft/hi-ml/pull/267)) Correct PYTHONPATH for Windows in VS Code settings
([#266]https://github.com/microsoft/hi-ml/pull/266)) Pin jinja2 package to avoid ‘No attribute Markup’ bug in version 3.1.0
(#246) Added tolerance to
test_attentionlayers.py
.(#198) Dependencies for histopathology folder are no longer specified in
test_requirements.txt
, but correctly in the histopathology Conda environment.(#188) Updated DeepSMILES models. Now they are uptodate with innereye-dl.
(#179) HEDJitter was jittering the D channel as well. StainNormalization was relying on skimage.
(#195) Fix DeepMIL metrics bug whereby hard labels were used instead of probabilities.
Removed
Deprecated
0.1.13
Added
(#170) Add utils including bag sampling, bounding boxes, HEDJitter, StainNormalisation and add attention layers
Changed
(#173) Improve report tool: allow lists of tables, option for zipping report folder, option for base64 encoding images
Removed
Deprecated
0.1.12
Added
Changed
Fixed
(#161) Empty string as target folder for a dataset creates an invalid mounting path for the dataset in AzureML (fixes #160)
(#167) Fix bugs in logging hyperparameters: logging as name/value table, rather than one column per hyperparameter. Use string logging for all hyperparameters
(#174) Fix bugs in returned local_checkpoint_path when downloading checkpoints from AML run
Removed
Deprecated
0.1.10
Added
Changed
([#129])https://github.com/microsoft/hi-ml/pull/129)) Refactor command line tools’ arguments. Refactor health_azure.utils’ various get_run functions. Replace argparsing with parametrized classes.
Fixed
Removed
Deprecated
0.1.9 (2021-10-20)
Added
Changed
(#133) Made large breaking changes to module names, from
health.azure
tohealth_azure
.([#141])(https://github.com/microsoft/hi-ml/pull/141)) Update changelog for release and increase scope of test_register_environment to ensure that by default environments are registered with a version number
0.1.8 (2021-10-06)
Added
0.1.7 (2021-10-04)
Changed
(#112) Update himl_tensorboard to work with files not in ‘logs’ directory
(#106) Split into two packages. Most of existing package renamed to hi-ml-azure, remained remains hi-ml.
(#113) Add helper function to download files from AML Run, tidied up some command line args, and moved some functions from himl.py to azure_util.py
(#122) Add helper functions to upload to and download from AML Datastores
0.1.4 (2021-09-15)
This is the baseline release.
health_azure Package
- class health_azure.AzureRunInfo(input_datasets, output_datasets, mount_contexts, run, is_running_in_azure_ml, output_folder, logs_folder)[source]
This class stores all information that a script needs to run inside and outside of AzureML. It is return from submit_to_azure_if_needed, where the return value depends on whether the script is inside or outside AzureML.
Please check the source code for detailed documentation for all fields.
- input_datasets: List[Optional[pathlib.Path]]
A list of folders that contain all the datasets that the script uses as inputs. Input datasets must be specified when calling submit_to_azure_if_needed. Here, they are made available as Path objects. If no input datasets are specified, the list is empty.
- is_running_in_azure_ml: bool
If True, the present script is executing inside AzureML. If False, outside AzureML.
- logs_folder: Optional[pathlib.Path]
The folder into which all log files (for example, tensorboard) should be written. All files written to this folder will be uploaded to blob storage regularly during the script run.
- mount_contexts: List[azureml.dataprep.fuse.daemon.MountContext]
A list of mount contexts for input datasets when running outside AzureML. There will be a mount context for each input dataset where there is no local_folder, there is a workspace, and use_mounting is set. This list is maintained only to prevent exit from these contexts until the RunInfo object is deleted.
- output_datasets: List[Optional[pathlib.Path]]
A list of folders that contain all the datasets that the script uses as outputs. Output datasets must be specified when calling submit_to_azure_if_needed. Here, they are made available as Path objects. If no output datasets are specified, the list is empty.
- output_folder: Optional[pathlib.Path]
The output folder into which all script outputs should be written, if they should be later available in the AzureML portal. Files written to this folder will be uploaded to blob storage at the end of the script run.
- run: Optional[azureml.core.run.Run]
An AzureML Run object if the present script is executing inside AzureML, or None if outside of AzureML. The Run object has methods to log metrics, upload files, etc.
- class health_azure.DatasetConfig(name, datastore='', overwrite_existing=True, version=None, use_mounting=None, target_folder=None, local_folder=None)[source]
Contains information to use AzureML datasets as inputs or outputs.
- Parameters
name (
str
) – The name of the dataset, as it was registered in the AzureML workspace. For output datasets, this will be the name given to the newly created dataset.datastore (
str
) – The name of the AzureML datastore that holds the dataset. This can be empty if the AzureML workspace has only a single datastore, or if the default datastore should be used.overwrite_existing (
bool
) – Only applies to uploading datasets. If True, the dataset will be overwritten if it already exists. If False, the dataset creation will fail if the dataset already exists.version (
Optional
[int
]) – The version of the dataset that should be used. This is only used for input datasets. If the version is not specified, the latest version will be used.use_mounting (
Optional
[bool
]) – If True, the dataset will be “mounted”, that is, individual files will be read or written on-demand over the network. If False, the dataset will be fully downloaded before the job starts, respectively fully uploaded at job end for output datasets. Defaults: False (downloading) for datasets that are script inputs, True (mounting) for datasets that are script outputs.target_folder (
Union
[Path
,str
,None
]) – The folder into which the dataset should be downloaded or mounted. If left empty, a random folder on /tmp will be chosen. Do NOT use “.” as the target_folder.local_folder (
Union
[Path
,str
,None
]) – The folder on the local machine at which the dataset is available. This is used only for runs outside of AzureML. If this is empty then the target_folder will be used to mount or download the dataset.
- to_input_dataset(dataset_index, workspace, strictly_aml_v1, ml_client=None)[source]
Creates a configuration for using an AzureML dataset inside of an AzureML run. This will make the AzureML dataset with given name available as a named input, using INPUT_0 as the key for dataset index 0.
- Parameters
workspace (
Workspace
) – The AzureML workspace to read from.dataset_index (
int
) – Suffix for using datasets as named inputs, the dataset will be marked INPUT_{index}strictly_aml_v1 (
bool
) – If True, use Azure ML SDK v1. Otherwise, attempt to use Azure ML SDK v2.ml_client (
Optional
[MLClient
]) – An Azure MLClient object for interacting with Azure resources.
- Return type
Optional
[DatasetConsumptionConfig
]
- to_input_dataset_local(workspace)[source]
Return a local path to the dataset when outside of an AzureML run. If local_folder is supplied, then this is assumed to be a local dataset, and this is returned. Otherwise the dataset is mounted or downloaded to either the target folder or a temporary folder and that is returned. If self.name refers to a v2 dataset, it is not possible to mount the data here, therefore a tuple of Nones will be returned.
- Parameters
workspace (
Workspace
) – The AzureML workspace to read from.- Return type
Tuple
[Path
,Optional
[MountContext
]]- Returns
Tuple of (path to dataset, optional mountcontext)
- to_output_dataset(workspace, dataset_index)[source]
Creates a configuration to write a script output to an AzureML dataset. The name and datastore of this new dataset will be taken from the present object.
- Parameters
workspace (
Workspace
) – The AzureML workspace to read from.dataset_index (
int
) – Suffix for using datasets as named inputs, the dataset will be marked OUTPUT_{index}
- Return type
OutputFileDatasetConfig
- Returns
An AzureML OutputFileDatasetConfig object, representing the output dataset.
- health_azure.aggregate_hyperdrive_metrics(child_run_arg_name, run_id=None, run=None, keep_metrics=None, aml_workspace=None, workspace_config_path=None)[source]
For a given HyperDriveRun object, or id of a HyperDriveRun, retrieves the metrics from each of its children and then aggregates it. Optionally filters the metrics logged in the Run, by providing a list of metrics to keep. Returns a DataFrame where each column is one child run, and each row is a metric logged by that child run. For example, for a HyperDrive run with 2 children, where each logs epoch, accuracy and loss, the result would look like:
| | 0 | 1 | |--------------|-----------------|--------------------| | epoch | [1, 2, 3] | [1, 2, 3] | | accuracy | [0.7, 0.8, 0.9] | [0.71, 0.82, 0.91] | | loss | [0.5, 0.4, 0.3] | [0.45, 0.37, 0.29] |
here each column is one of the splits/ child runs, and each row is one of the metrics you have logged to the run.
It is possible to log rows and tables in Azure ML by calling run.log_table and run.log_row respectively. In this case, the DataFrame will contain a Dictionary entry instead of a list, where the keys are the table columns (or keywords provided to log_row), and the values are the table values. E.g.:
| | 0 | 1 | |----------------|------------------------------------------|-------------------------------------------| | accuracy_table |{'epoch': [1, 2], 'accuracy': [0.7, 0.8]} | {'epoch': [1, 2], 'accuracy': [0.8, 0.9]} |
It is also possible to log plots in Azure ML by calling run.log_image and passing in a matplotlib plot. In this case, the DataFrame will contain a string representing the path to the artifact that is generated by AML (the saved plot in the Logs & Outputs pane of your run on the AML portal). E.g.:
| | 0 | 1 | |----------------|-----------------------------------------|---------------------------------------| | accuracy_plot | aml://artifactId/ExperimentRun/dcid.... | aml://artifactId/ExperimentRun/dcid...|
- Parameters
child_run_arg_name (
str
) – the name of the argument given to each child run to denote its position relative to other child runs (e.g. this arg could equal ‘child_run_index’ - then each of your child runs should expect to receive the arg ‘–child_run_index’ with a value <= the total number of child runs)run (
Optional
[Run
]) – An Azure ML HyperDriveRun object to aggregate the metrics from. Either this or run_id must be providedrun_id (
Optional
[str
]) – The id (type: str) of a parent/ HyperDrive run. Either this or run must be provided.keep_metrics (
Optional
[List
[str
]]) – An optional list of metric names to filter the returned metrics byaml_workspace (
Optional
[Workspace
]) – If run_id is provided, this is an optional AML Workspace object to retrieve the Run fromworkspace_config_path (
Optional
[Path
]) – If run_id is provided, this is an optional path to a config containing details of the AML Workspace object to retrieve the Run from.
- Return type
DataFrame
- Returns
A Pandas DataFrame containing the aggregated metrics from each child run
- health_azure.create_aml_run_object(experiment_name, run_name=None, workspace=None, workspace_config_path=None, snapshot_directory=None)[source]
Creates an AzureML Run object in the given workspace, or in the workspace given by the AzureML config file. This Run object can be used to write metrics to AzureML, upload files, etc, when the code is not running in AzureML. After finishing all operations, use run.flush() to write metrics to the cloud, and run.complete() or run.fail().
Example: >>>run = create_aml_run_object(experiment_name=”run_on_my_vm”, run_name=”try1”) >>>run.log(“foo”, 1.23) >>>run.flush() >>>run.complete()
- Parameters
experiment_name (
str
) – The AzureML experiment that should hold the run that will be created.run_name (
Optional
[str
]) – An optional name for the run (this will be used as the display name in the AzureML UI)workspace (
Optional
[Workspace
]) – If provided, use this workspace to create the run in. If not provided, use the workspace specified by the config.json file in the folder or its parent folder(s).workspace_config_path (
Optional
[Path
]) – If not provided with an AzureML workspace, then load one given the information in this config file.snapshot_directory (
Union
[Path
,str
,None
]) – The folder that should be included as the code snapshot. By default, no snapshot is created (snapshot_directory=None or snapshot_directory=””). Set this to the folder that contains all the code your experiment uses. You can use a file .amlignore to skip specific files or folders, akin to .gitignore
- Return type
Run
- Returns
An AzureML Run object.
- health_azure.create_crossval_hyperdrive_config(num_splits, cross_val_index_arg_name='crossval_index', metric_name='val/loss')[source]
Creates an Azure ML HyperDriveConfig object for running cross validation. Note: this config expects a metric named <metric_name> to be logged in your training script([see here]( https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters#log-metrics-for-hyperparameter-tuning))
- Parameters
num_splits (
int
) – The number of splits for k-fold cross validationcross_val_index_arg_name (
str
) – The name of the commandline argument that each of the child runs gets, to indicate which split they should work on.metric_name (
str
) – The name of the metric that the HyperDriveConfig will compare runs by. Please note that it is your responsibility to make sure a metric with this name is logged to the Run in your training script
- Return type
HyperDriveConfig
- Returns
an Azure ML HyperDriveConfig object
- health_azure.create_run_configuration(workspace, compute_cluster_name, conda_environment_file=None, aml_environment_name='', environment_variables=None, pip_extra_index_url='', private_pip_wheel_path=None, docker_base_image='', docker_shm_size='', num_nodes=1, max_run_duration='', input_datasets=None, output_datasets=None)[source]
Creates an AzureML run configuration, that contains information about environment, multi node execution, and Docker.
- Parameters
workspace (
Workspace
) – The AzureML Workspace to use.aml_environment_name (
str
) – The name of an AzureML environment that should be used to submit the script. If not provided, an environment will be created from the arguments to this function (conda_environment_file, pip_extra_index_url, environment_variables, docker_base_image)max_run_duration (
str
) – The maximum runtime that is allowed for this job in AzureML. This is given as a floating point number with a string suffix s, m, h, d for seconds, minutes, hours, day. Examples: ‘3.5h’, ‘2d’compute_cluster_name (
str
) – The name of the AzureML cluster that should run the job. This can be a cluster with CPU or GPU machines.conda_environment_file (
Optional
[Path
]) – The conda configuration file that describes which packages are necessary for your script to run.environment_variables (
Optional
[Dict
[str
,str
]]) – The environment variables that should be set when running in AzureML.docker_base_image (
str
) – The Docker base image that should be used when creating a new Docker image.docker_shm_size (
str
) – The Docker shared memory size that should be used when creating a new Docker image.pip_extra_index_url (
str
) – If provided, use this PIP package index to find additional packages when building the Docker image.private_pip_wheel_path (
Optional
[Path
]) – If provided, add this wheel as a private package to the AzureML workspace.conda_environment_file – The file that contains the Conda environment definition.
input_datasets (
Optional
[List
[DatasetConfig
]]) – The script will consume all data in folder in blob storage as the input. The folder must exist in blob storage, in the location that you gave when creating the datastore. Once the script has run, it will also register the data in this folder as an AzureML dataset.output_datasets (
Optional
[List
[DatasetConfig
]]) – The script will create a temporary folder when running in AzureML, and while the job writes data to that folder, upload it to blob storage, in the data store.num_nodes (
int
) – The number of nodes to use in distributed training on AzureML.
- Return type
RunConfiguration
- Returns
- health_azure.create_script_run(script_params, snapshot_root_directory=None, entry_script=None)[source]
Creates an AzureML ScriptRunConfig object, that holds the information about the snapshot, the entry script, and its arguments.
- Parameters
script_params (
List
[str
]) – A list of parameter to pass on to the script as it runs in AzureML. Required arg. Script parameters can be generated using the_get_script_params()
function.snapshot_root_directory (
Optional
[Path
]) – The directory that contains all code that should be packaged and sent to AzureML. All Python code that the script uses must be copied over.entry_script (
Union
[Path
,str
,None
]) – The script that should be run in AzureML. If None, the current main Python file will be executed.
- Return type
ScriptRunConfig
- Returns
- health_azure.download_checkpoints_from_run_id(run_id, checkpoint_path_or_folder, output_folder, aml_workspace=None, workspace_config_path=None)[source]
Given an Azure ML run id, download all files from a given checkpoint directory within that run, to the path specified by output_path. If running in AML, will take the current workspace. Otherwise, if neither aml_workspace nor workspace_config_path are provided, will try to locate a config.json file in any of the parent folders of the current working directory.
- Parameters
run_id (
str
) – The id of the run to download checkpoints fromcheckpoint_path_or_folder (
str
) – The path to the either a single checkpoint file, or a directory of checkpoints within the run files. If a folder is provided, all files within it will be downloaded.output_folder (
Path
) – The path to which the checkpoints should be storedaml_workspace (
Optional
[Workspace
]) – Optional AML workspace objectworkspace_config_path (
Optional
[Path
]) – Optional workspace config file
- Return type
None
- health_azure.download_files_from_run_id(run_id, output_folder, prefix='', workspace=None, workspace_config_path=None, validate_checksum=False)[source]
For a given Azure ML run id, first retrieve the Run, and then download all files, which optionally start with a given prefix. E.g. if the Run creates a folder called “outputs”, which you wish to download all files from, specify prefix=”outputs”. To download all files associated with the run, leave prefix empty.
If not running inside AML and neither a workspace nor the config file are provided, the code will try to locate a config.json file in any of the parent folders of the current working directory. If that succeeds, that config.json file will be used to instantiate the workspace.
If function is called in a distributed PyTorch training script, the files will only be downloaded once per node (i.e, all process where is_local_rank_zero() == True). All processes will exit this function once all downloads are completed.
- Parameters
run_id (
str
) – The id of the Azure ML Runoutput_folder (
Path
) – Local directory to which the Run files should be downloaded.prefix (
str
) – Optional prefix to filter Run files byworkspace (
Optional
[Workspace
]) – Optional Azure ML Workspace objectworkspace_config_path (
Optional
[Path
]) – Optional path to settings for Azure ML Workspacevalidate_checksum (
bool
) – Whether to validate the content from HTTP response
- Return type
None
- health_azure.download_from_datastore(datastore_name, file_prefix, output_folder, aml_workspace=None, workspace_config_path=None, overwrite=False, show_progress=False)[source]
Download file(s) from an Azure ML Datastore that are registered within a given Workspace. The path to the file(s) to be downloaded, relative to the datastore <datastore_name>, is specified by the parameter “prefix”. Azure will search for files within the Datastore whose paths begin with this string. If you wish to download multiple files from the same folder, set <prefix> equal to that folder’s path within the Datastore. If you wish to download a single file, include both the path to the folder it resides in, as well as the filename itself. If the relevant file(s) are found, they will be downloaded to the folder specified by <output_folder>. If this directory does not already exist, it will be created. E.g. if your datastore contains the paths [“foo/bar/1.txt”, “foo/bar/2.txt”] and you call this function with file_prefix=”foo/bar” and output_folder=”outputs”, you would end up with the files [“outputs/foo/bar/1.txt”, “outputs/foo/bar/2.txt”]
If not running inside AML and neither a workspace nor the config file are provided, the code will try to locate a config.json file in any of the parent folders of the current working directory. If that succeeds, that config.json file will be used to instantiate the workspace.
- Parameters
datastore_name (
str
) – The name of the Datastore containing the blob to be downloaded. This Datastore itself must be an instance of an AzureBlobDatastore.file_prefix (
str
) – The prefix to the blob to be downloadedoutput_folder (
Path
) – The directory into which the blob should be downloadedaml_workspace (
Optional
[Workspace
]) – Optional Azure ML Workspace objectworkspace_config_path (
Optional
[Path
]) – Optional path to settings for Azure ML Workspaceoverwrite (
bool
) – If True, will overwrite any existing file at the same remote path. If False, will skip any duplicate file.show_progress (
bool
) – If True, will show the progress of the file download
- Return type
None
- health_azure.fetch_run(workspace, run_recovery_id)[source]
Finds an existing run in an experiment, based on a recovery ID that contains the experiment ID and the actual RunId. The run can be specified either in the experiment_name:run_id format, or just the run_id.
- Parameters
workspace (
Workspace
) – the configured AzureML workspace to search for the experiment.run_recovery_id (
str
) – The Run to find. Either in the full recovery ID format, experiment_name:run_id or just the run_id
- Return type
Run
- Returns
The AzureML run.
- health_azure.get_most_recent_run(run_recovery_file, workspace)[source]
Gets the name of the most recently executed AzureML run, instantiates that Run object and returns it.
- Parameters
run_recovery_file (
Path
) – The path of the run recovery fileworkspace (
Workspace
) – Azure ML Workspace
- Return type
Run
- Returns
The Run
- health_azure.get_workspace(aml_workspace=None, workspace_config_path=None)[source]
Retrieve an Azure ML Workspace by going through the following steps:
If the function has been called from inside a run in AzureML, it returns the current AzureML workspace.
If a Workspace object has been provided in the aml_workspace argument, return that.
If a path to a Workspace config file has been provided, load the workspace according to that config file.
If a Workspace config file is present in the current working directory or one of its parents, load the
workspace according to that config file.
If 3 environment variables are found, use them to identify the workspace (HIML_RESOURCE_GROUP,
HIML_SUBSCRIPTION_ID, HIML_WORKSPACE_NAME)
If none of the above succeeds, an exception is raised.
- Parameters
aml_workspace (
Optional
[Workspace
]) – If provided this is returned as the AzureML Workspace.workspace_config_path (
Optional
[Path
]) – If not provided with an AzureML Workspace, then load one given the information in this config
- Return type
Workspace
- Returns
An AzureML workspace.
- Raises
ValueError – If none of the available options for accessing the workspace succeeds.
FileNotFoundError – If the workspace config file is given in workspace_config_path, but is not present.
- health_azure.health_azure_package_setup()[source]
Set up the Python packages where needed. In particular, reduce the logging level for some of the used libraries, which are particularly talkative in DEBUG mode. Usually when running in DEBUG mode, we want diagnostics about the model building itself, but not for the underlying libraries.
- Return type
None
- health_azure.is_running_in_azure_ml(aml_run=<azureml.core.run._OfflineRun object>)[source]
Returns True if the given run is inside of an AzureML machine, or False if it is on a machine outside AzureML. When called without arguments, this functions returns True if the present code is running in AzureML. Note that in runs with “compute_target=’local’” this function will also return True. Such runs execute outside of AzureML, but are able to log all their metrics, etc to an AzureML run.
- Parameters
aml_run (
Run
) – The run to check. If omitted, use the default run in RUN_CONTEXT- Return type
bool
- Returns
True if the given run is inside of an AzureML machine, or False if it is a machine outside AzureML.
- health_azure.object_to_yaml(o)[source]
Converts an object to a YAML string representation. This is done by recursively traversing all attributes and writing them out to YAML if they are basic datatypes.
- Parameters
o (
Any
) – The object to inspect.- Return type
str
- Returns
A string in YAML format.
- health_azure.set_environment_variables_for_multi_node()[source]
Sets the environment variables that PyTorch Lightning needs for multi-node training.
- Return type
None
- health_azure.set_logging_levels(levels)[source]
Sets the logging levels for the given module-level loggers.
- Parameters
levels (
Dict
[str
,int
]) – A mapping from module name to desired logging level.- Return type
None
- health_azure.split_recovery_id(id_str)[source]
Splits a run ID into the experiment name and the actual run. The argument can be in the format ‘experiment_name:run_id’, or just a run ID like user_branch_abcde12_123. In the latter case, everything before the last two alphanumeric parts is assumed to be the experiment name.
- Parameters
id_str (
str
) – The string run ID.- Return type
Tuple
[str
,str
]- Returns
experiment name and run name
- health_azure.submit_run(workspace, experiment_name, script_run_config, tags=None, wait_for_completion=False, wait_for_completion_show_output=False, display_name=None)[source]
Starts an AzureML run on a given workspace, via the script_run_config.
- Parameters
workspace (
Workspace
) – The AzureML workspace to use.experiment_name (
str
) – The name of the experiment that will be used or created. If the experiment name contains characters that are not valid in Azure, those will be removed.script_run_config (
Union
[ScriptRunConfig
,HyperDriveConfig
]) – The settings that describe which script should be run.tags (
Optional
[Dict
[str
,str
]]) – A dictionary of string key/value pairs, that will be added as metadata to the run. If set to None, a default metadata field will be added that only contains the commandline arguments that started the run.wait_for_completion (
bool
) – If False (the default) return after the run is submitted to AzureML, otherwise wait for the completion of this run (if True).wait_for_completion_show_output (
bool
) – If wait_for_completion is True this parameter indicates whether to show the run output on sys.stdout.display_name (
Optional
[str
]) – The name for the run that will be displayed in the AML UI. If not provided, a random display name will be generated by AzureML.
- Return type
Run
- Returns
An AzureML Run object.
- health_azure.submit_to_azure_if_needed(compute_cluster_name='', entry_script=None, aml_workspace=None, workspace_config_file=None, ml_client=None, snapshot_root_directory=None, script_params=None, conda_environment_file=None, aml_environment_name='', experiment_name=None, environment_variables=None, pip_extra_index_url='', private_pip_wheel_path=None, docker_base_image='mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.3-cudnn8-ubuntu20.04:20230509.v1', docker_shm_size='100g', ignored_folders=None, default_datastore='', input_datasets=None, output_datasets=None, num_nodes=1, wait_for_completion=False, wait_for_completion_show_output=False, max_run_duration='', submit_to_azureml=None, tags=None, after_submission=None, hyperdrive_config=None, hyperparam_args=None, strictly_aml_v1=False, identity_based_auth=False, pytorch_processes_per_node_v2=None, use_mpi_run_for_single_node_jobs=True, display_name=None)[source]
Submit a folder to Azure, if needed and run it. Use the commandline flag –azureml to submit to AzureML, and leave it out to run locally.
- Parameters
after_submission (
Union
[Callable
[[Run
],None
],Callable
[[Job
,MLClient
],None
],None
]) – A function that will be called directly after submitting the job to AzureML. Use this to, for example, add additional tags or print information about the run. When using AzureML SDK V1, the only argument to this function is the Run object that was just submitted. When using AzureML SDK V2, the arguments are (Job, MLClient).tags (
Optional
[Dict
[str
,str
]]) – A dictionary of string key/value pairs, that will be added as metadata to the run. If set to None, a default metadata field will be added that only contains the commandline arguments that started the run.aml_environment_name (
str
) – The name of an AzureML environment that should be used to submit the script. If not provided, an environment will be created from the arguments to this function.max_run_duration (
str
) – The maximum runtime that is allowed for this job in AzureML. This is given as a floating point number with a string suffix s, m, h, d for seconds, minutes, hours, day. Examples: ‘3.5h’, ‘2d’experiment_name (
Optional
[str
]) – The name of the AzureML experiment in which the run should be submitted. If omitted, this is created based on the name of the current script.entry_script (
Union
[Path
,str
,None
]) – The script that should be run in AzureMLcompute_cluster_name (
str
) – The name of the AzureML cluster that should run the job. This can be a cluster with CPU or GPU machines.conda_environment_file (
Union
[Path
,str
,None
]) – The conda configuration file that describes which packages are necessary for your script to run.aml_workspace (
Optional
[Workspace
]) – There are two optional parameters used to glean an existing AzureML Workspace. The simplest is to pass it in as a parameter.workspace_config_file (
Union
[Path
,str
,None
]) – The 2nd option is to specify the path to the config.json file downloaded from the Azure portal from which we can retrieve the existing Workspace.ml_client (
Optional
[MLClient
]) – An Azure MLClient object for interacting with Azure resources.snapshot_root_directory (
Union
[Path
,str
,None
]) – The directory that contains all code that should be packaged and sent to AzureML. All Python code that the script uses must be copied over.ignored_folders (
Optional
[List
[Union
[Path
,str
]]]) – A list of folders to exclude from the snapshot when copying it to AzureML.script_params (
Optional
[List
[str
]]) – A list of parameters to pass on to the script as it runs in AzureML. If None (the default), these will be copied over from sys.argv (excluding the –azureml flag, if found).environment_variables (
Optional
[Dict
[str
,str
]]) – The environment variables that should be set when running in AzureML.docker_base_image (
str
) – The Docker base image that should be used when creating a new Docker image. The list of available images can be found here: https://github.com/Azure/AzureML-Containers The default image is mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn8-ubuntu18.04docker_shm_size (
str
) – The Docker shared memory size that should be used when creating a new Docker image. Default value is ‘100g’.pip_extra_index_url (
str
) – If provided, use this PIP package index to find additional packages when building the Docker image.private_pip_wheel_path (
Union
[Path
,str
,None
]) – If provided, add this wheel as a private package to the AzureML workspace.default_datastore (
str
) – The data store in your AzureML workspace, that points to your training data in blob storage. This is described in more detail in the README.input_datasets (
Optional
[List
[Union
[str
,DatasetConfig
]]]) – The script will consume all data in folder in blob storage as the input. The folder must exist in blob storage, in the location that you gave when creating the datastore. Once the script has run, it will also register the data in this folder as an AzureML dataset.output_datasets (
Optional
[List
[Union
[str
,DatasetConfig
]]]) – The script will create a temporary folder when running in AzureML, and while the job writes data to that folder, upload it to blob storage, in the data store.num_nodes (
int
) – The number of nodes to use in distributed training on AzureML. When using a value > 1, multiple nodes in AzureML will be started. If pytorch_processes_per_node_v2=None, the job will be submitted as a multi-node MPI job, with 1 process per node. This is suitable for PyTorch Lightning jobs. If pytorch_processes_per_node_v2 is not None, a job with framework “PyTorch” and communication backend “nccl” will be started. pytorch_processes_per_node_v2 will guide the number of processes per node. This is suitable for plain PyTorch training jobs without the use of frameworks like PyTorch Lightning.wait_for_completion (
bool
) – If False (the default) return after the run is submitted to AzureML, otherwise wait for the completion of this run (if True).wait_for_completion_show_output (
bool
) – If wait_for_completion is True this parameter indicates whether to show the run output on sys.stdout.submit_to_azureml (
Optional
[bool
]) – If True, the codepath to create an AzureML run will be executed. If False, the codepath for local execution (i.e., return immediately) will be executed. If not provided (None), submission to AzureML will be triggered if the commandline flag ‘–azureml’ is present in sys.argvhyperdrive_config (
Optional
[HyperDriveConfig
]) – A configuration object for Hyperdrive (hyperparameter search).strictly_aml_v1 (
bool
) – If True, use Azure ML SDK v1. Otherwise, attempt to use Azure ML SDK v2.pytorch_processes_per_node_v2 (
Optional
[int
]) – For plain PyTorch multi-GPU processing: The number of processes per node. This is only supported with AML SDK v2, and ignored in v1. If supplied, the job will be submitted as using the “pytorch” framework (rather than “Python”), and using “nccl” as the communication backend.use_mpi_run_for_single_node_jobs (
bool
) – If True, even single node jobs with SDK v2 will be run as distributed MPI jobs. This is required for Kubernetes compute. If False, single node jobs will not be run as distributed jobs. This setting only affects jobs submitted with SDK v2 (when strictly_aml_v1=False)display_name (
Optional
[str
]) – The name for the run that will be displayed in the AML UI. If not provided, a random display name will be generated by AzureML.
- Return type
- Returns
If the script is submitted to AzureML then we terminate python as the script should be executed in AzureML, otherwise we return a AzureRunInfo object.
- health_azure.torch_barrier()[source]
This is a barrier to use in distributed jobs. Use it to make all processes that participate in a distributed pytorch job to wait for each other. When torch.distributed is not set up or not found, the function exits immediately.
- Return type
None
- health_azure.upload_to_datastore(datastore_name, local_data_folder, remote_path, aml_workspace=None, workspace_config_path=None, overwrite=False, show_progress=False)[source]
Upload a folder to an Azure ML Datastore that is registered within a given Workspace. Note that this will upload all files within the folder, but will not copy the folder itself. E.g. if you specify the local_data_dir=”foo/bar” and that contains the files [“1.txt”, “2.txt”], and you specify the remote_path=”baz”, you would see the following paths uploaded to your Datastore: [“baz/1.txt”, “baz/2.txt”]
If not running inside AML and neither a workspace nor the config file are provided, the code will try to locate a config.json file in any of the parent folders of the current working directory. If that succeeds, that config.json file will be used to instantiate the workspace.
- Parameters
datastore_name (
str
) – The name of the Datastore to which the blob should be uploaded. This Datastore itself must be an instance of an AzureBlobDatastorelocal_data_folder (
Path
) – The path to the local directory containing the data to be uploadedremote_path (
Path
) – The path to which the blob should be uploadedaml_workspace (
Optional
[Workspace
]) – Optional Azure ML Workspace objectworkspace_config_path (
Optional
[Path
]) – Optional path to settings for Azure ML Workspaceoverwrite (
bool
) – If True, will overwrite any existing file at the same remote path. If False, will skip any duplicate files and continue to the next.show_progress (
bool
) – If True, will show the progress of the file download
- Return type
None
- health_azure.write_yaml_to_object(o, yaml_string, strict=False)[source]
Writes a serialized object in YAML format back into an object, assuming that the attributes of the object and the YAML field names are in sync.
- Parameters
strict (
bool
) – If True, any mismatch of field names will raise a ValueError. If False, only a warning will be
printed. Note that the object may have been modified even if an error is raised. :type o:
Any
:param o: The object to write to. :type yaml_string:str
:param yaml_string: A YAML formatted string with attribute names and values.- Return type
None
Functions
|
Creates an AzureML Run object in the given workspace, or in the workspace given by the AzureML config file. |
|
Creates an AzureML run configuration, that contains information about environment, multi node execution, and Docker. |
|
Creates an AzureML ScriptRunConfig object, that holds the information about the snapshot, the entry script, and its arguments. |
|
For a given Azure ML run id, first retrieve the Run, and then download all files, which optionally start with a given prefix. |
|
Given an Azure ML run id, download all files from a given checkpoint directory within that run, to the path specified by output_path. |
|
Download file(s) from an Azure ML Datastore that are registered within a given Workspace. |
|
Finds an existing run in an experiment, based on a recovery ID that contains the experiment ID and the actual RunId. |
|
Gets the name of the most recently executed AzureML run, instantiates that Run object and returns it. |
|
Retrieve an Azure ML Workspace by going through the following steps: |
|
Returns True if the given run is inside of an AzureML machine, or False if it is on a machine outside AzureML. |
Sets the environment variables that PyTorch Lightning needs for multi-node training. |
|
|
Splits a run ID into the experiment name and the actual run. |
|
Starts an AzureML run on a given workspace, via the script_run_config. |
Submit a folder to Azure, if needed and run it. |
|
This is a barrier to use in distributed jobs. |
|
|
Upload a folder to an Azure ML Datastore that is registered within a given Workspace. |
|
Creates an Azure ML HyperDriveConfig object for running cross validation. |
|
For a given HyperDriveRun object, or id of a HyperDriveRun, retrieves the metrics from each of its children and then aggregates it. Optionally filters the metrics logged in the Run, by providing a list of metrics to keep. Returns a DataFrame where each column is one child run, and each row is a metric logged by that child run. For example, for a HyperDrive run with 2 children, where each logs epoch, accuracy and loss, the result would look like::. |
Converts an object to a YAML string representation. |
|
|
Writes a serialized object in YAML format back into an object, assuming that the attributes of the object and the YAML field names are in sync. |
Set up the Python packages where needed. |
|
|
Sets the logging levels for the given module-level loggers. |
Classes
|
This class stores all information that a script needs to run inside and outside of AzureML. |
|
Contains information to use AzureML datasets as inputs or outputs. |
health_ml Package
- class health_ml.Runner(project_root)[source]
This class contains the high-level logic to start a training run: choose a model configuration by name, submit to AzureML if needed, or otherwise start the actual training and test loop.
- Parameters
project_root (
Path
) – The root folder that contains all of the source code that should be executed.
- additional_run_tags(script_params)[source]
Gets the set of tags that will be added to the AzureML run as metadata.
- Parameters
script_params (
List
[str
]) – The commandline arguments used to invoke the present script.- Return type
Dict
[str
,str
]
- parse_and_load_model()[source]
Parses the command line arguments, and creates configuration objects for the model itself, and for the Azure-related parameters. Sets self.experiment_config to its proper values. Returns the parser output from parsing the model commandline arguments.
- Return type
ParserResult
- Returns
ParserResult object containing args, overrides and settings
- run()[source]
The main entry point for training and testing models from the commandline. This chooses a model to train via a commandline argument, runs training or testing, and writes all required info to disk and logs.
- Return type
Tuple
[LightningContainer
,AzureRunInfo
]- Returns
a tuple of the LightningContainer object and an AzureRunInfo containing all information about the present run (whether running in AzureML or not)
- run_in_situ(azure_run_info)[source]
Actually run the AzureML job; this method will typically run on an Azure VM.
- Parameters
azure_run_info (
AzureRunInfo
) – Contains all information about the present run in AzureML, in particular where the
datasets are mounted.
- Return type
None
- submit_to_azureml_if_needed()[source]
Submit a job to AzureML, returning the resulting Run object, or exiting if we were asked to wait for completion and the Run did not succeed.
- Return type
- Returns
an AzureRunInfo object containing all of the details of the present run. If AzureML is not specified, the attribute ‘run’ will None, but the object still contains helpful information about datasets etc
- class health_ml.TrainingRunner(experiment_config, container, project_root=None)[source]
Driver class to run an ML experiment. Note that the project root argument MUST be supplied when using hi-ml as a package!
- Parameters
experiment_config (
ExperimentConfig
) – The ExperimentConfig object to use for training.container (
LightningContainer
) – The LightningContainer object to use for training.project_root (
Optional
[Path
]) – Project root. This should only be omitted if calling run_ml from the test suite. Supplying it is crucial when using hi-ml as a package or submodule!
- after_ddp_cleanup(environ_before_training)[source]
Run processes cleanup after ddp context to prepare for single device inference. Kill all processes in DDP besides rank 0.
- Return type
None
- end_training(environ_before_training)[source]
Cleanup after training is done. This is called after the trainer has finished fitting the data. This is called to update the checkpoint handler state and remove redundant checkpoint files. If running inference on a single device, this is also called to kill all processes besides rank 0.
- Return type
None
- get_data_module()[source]
Reads the datamodule that should be used for training or valuation from the container. This must be overridden in subclasses.
- Return type
LightningDataModule
- init_inference()[source]
Prepare the trainer for running inference on the validation and test set. This chooses a checkpoint, initializes the PL Trainer object, and chooses the right data module. The hook for running inference on the validation set is run (LightningContainer.on_run_extra_validation_epoch) is first called to reflect any changes to the model or datamodule states before running inference.
- Return type
None
- init_training()[source]
Execute some bookkeeping tasks only once if running distributed and initialize the runner’s trainer object.
- Return type
None
- is_crossval_disabled_or_child_0()[source]
Returns True if the present run is a non-cross-validation run, or child run 0 of a cross-validation run.
- Return type
bool
- run_training()[source]
The main training loop. It creates the Pytorch model based on the configuration options passed in, creates a Pytorch Lightning trainer, and trains the model. If a checkpoint was specified, then it loads the checkpoint before resuming training. The cwd is changed to the outputs folder so that the model can write to current working directory, and still everything is put into the right place in AzureML (only the contents of the “outputs” folder is treated as a result file).
- Return type
None
Classes
|
Driver class to run an ML experiment. |
|
This class contains the high-level logic to start a training run: choose a model configuration by name, submit to AzureML if needed, or otherwise start the actual training and test loop. |