Documentation for the Health Intelligence Machine Learning toolbox hi-ml
First steps: How to get started with hi-ml
Setting up AzureML
You need to have an AzureML workspace in your Azure subscription.
Download the config file from your AzureML workspace, as described
here. Put this file (it
should be called config.json
) into the folder where your script lives, or one of its parent folders. You can use
parent folders up to the last parent that is still included in the PYTHONPATH
environment variable: hi-ml
will
try to be smart and search through all folders that it thinks belong to your current project.
Using the AzureML integration layer
Consider a simple use case, where you have a Python script that does something - this could be training a model,
or pre-processing some data. The hi-ml
package can help easily run that on Azure Machine Learning (AML) services.
Here is an example script that reads images from a folder, resizes and saves them to an output folder:
from pathlib import Path
if __name__ == '__main__':
input_folder = Path("/tmp/my_dataset")
output_folder = Path("/tmp/my_output")
for file in input_folder.glob("*.jpg"):
contents = read_image(file)
resized = contents.resize(0.5)
write_image(output_folder / file.name)
Doing that at scale can take a long time. We’d like to run that script in AzureML, consume the data from a folder in blob storage, and write the results back to blob storage, so that we can later use it as an input for model training.
You can achieve that by adding a call to submit_to_azure_if_needed
from the hi-ml
package:
from pathlib import Path
from health.azure import submit_to_azure_if_needed
if __name__ == '__main__':
current_file = Path(__file__)
run_info = submit_to_azure_if_needed(compute_cluster_name="preprocess-ds12",
input_datasets=["images123"],
# Omit this line if you don't create an output dataset (for example, in
# model training scripts)
output_datasets=["images123_resized"],
default_datastore="my_datastore")
# When running in AzureML, run_info.input_datasets and run_info.output_datasets will be populated,
# and point to the data coming from blob storage. For runs outside AML, the paths will be None.
# Replace the None with a meaningful path, so that we can still run the script easily outside AML.
input_dataset = run_info.input_datasets[0] or Path("/tmp/my_dataset")
output_dataset = run_info.output_datasets[0] or Path("/tmp/my_output")
files_processed = []
for file in input_dataset.glob("*.jpg"):
contents = read_image(file)
resized = contents.resize(0.5)
write_image(output_dataset / file.name)
files_processed.append(file.name)
# Any other files that you would not consider an "output dataset", like metrics, etc, should be written to
# a folder "./outputs". Any files written into that folder will later be visible in the AzureML UI.
# run_info.output_folder already points to the correct folder.
stats_file = run_info.output_folder / "processed_files.txt"
stats_file.write_text("\n".join(files_processed))
Once these changes are in place, you can submit the script to AzureML by supplying an additional --azureml
flag
on the commandline, like python myscript.py --azureml
.
Note that you do not need to modify the argument parser of your script to recognize the --azureml
flag.
Essential arguments to submit_to_azure_if_needed
When calling submit_to_azure_if_needed
, you can to supply the following parameters:
compute_cluster_name
(Mandatory): The name of the AzureML cluster that should run the job. This can be a cluster with CPU or GPU machines. See here for documentationentry_script
: The script that should be run. If omitted, thehi-ml
package will assume that you would like to submit the script that is presently running, given insys.argv[0]
.snapshot_root_directory
: The directory that contains all code that should be packaged and sent to AzureML. All Python code that the script uses must be copied over. This defaults to the current working directory, but can be one of its parents. If you would like to explicitly skip some folders inside thesnapshot_root_directory
, then useignored_folders
to specify those.conda_environment_file
: The conda configuration file that describes which packages are necessary for your script to run. If omitted, thehi-ml
package searches for a file calledenvironment.yml
in the current folder or its parents.
You can also supply an input dataset. For data pre-processing scripts, you can add an output dataset (omit this for ML training scripts).
To use datasets, you need to provision a data store in your AML workspace, that points to your training data in blob storage. This is described here.
input_datasets=["images123"]
in the code above means that the script will consume all data in folderimages123
in blob storage as the input. The folder must exist in blob storage, in the location that you gave when creating the datastore. Once the script has run, it will also register the data in this folder as an AML dataset.output_datasets=["images123_resized"]
means that the script will create a temporary folder when running in AML, and while the job writes data to that folder, upload it to blob storage, in the data store.
For more examples, please see examples.md. For more details about datasets, see here.
Additional arguments you should know about
submit_to_azure_if_needed
has a large number of arguments, please check the
API documentation for an exhaustive list.
The particularly helpful ones are listed below.
experiment_name
: All runs in AzureML are grouped in “experiments”. By default, the experiment name is determined by the name of the script you submit, but you can specify a name explicitly with this argument.environment_variables
: A dictionary with the contents of all environment variables that should be set inside the AzureML run, before the script is started.docker_base_image
: This specifies the name of the Docker base image to use for creating the Python environment for your script. The amount of memory to allocate for Docker is given bydocker_shm_size
.num_nodes
: The number of nodes on which your script should run. This is essential for distributed training.tags
: A dictionary mapping from string to string, with additional tags that will be stored on the AzureML run. This is helpful to add metadata about the run for later use.
Conda environments, Alternate pips, Private wheels
The function submit_to_azure_if_needed
tries to locate a Conda environment file in the current folder,
or in the Python path, with the name environment.yml
. The actual Conda environment file to use can be specified
directly with:
run_info = submit_to_azure_if_needed(
...
conda_environment_file=conda_environment_file,
where conda_environment_file
is a pathlib.Path
or a string identifying the Conda environment file to use.
The basic use of Conda assumes that packages listed are published Conda packages or published Python packages on PyPI. However, during development, the Python package may be on Test.PyPI, or in some other location, in which case the alternative package location can be specified directly with:
run_info = submit_to_azure_if_needed(
...
pip_extra_index_url="https://test.pypi.org/simple/",
Finally, it is possible to use a private wheel, if the package is only available locally with:
run_info = submit_to_azure_if_needed(
...
private_pip_wheel_path=private_pip_wheel_path,
where private_pip_wheel_path
is a pathlib.Path
or a string identifying the wheel package to use. In this case,
this wheel will be copied to the AzureML environment as a private wheel.
Connecting to Azure
Authentication
The hi-ml
package uses two possible ways of authentication with Azure.
The default is what is called “Interactive Authentication”. When you submit a job to Azure via hi-ml
, this will
use the credentials you used in the browser when last logging into Azure. If there are no credentials yet, you should
see instructions printed out to the console about how to log in using your browser.
We recommend using Interactive Authentication.
Alternatively, you can use a so-called Service Principal, for example within build pipelines.
Service Principal Authentication
A Service Principal is a form of generic identity or machine account. This is essential if you would like to submit training runs from code, for example from within an Azure pipeline. You can find more information about application registrations and service principal objects here.
If you would like to use Service Principal, you will need to create it in Azure first, and then store 3 pieces of information in 3 environment variables — please see the instructions below. When all the 3 environment variables are in place, your Azure submissions will automatically use the Service Principal to authenticate.
Creating the Service Principal
Navigate back to aka.ms/portal
Navigate to
App registrations
(use the top search bar to find it).Click on
+ New registration
on the top left of the page.Choose a name for your application e.g.
MyServicePrincipal
and clickRegister
.Once it is created you will see your application in the list appearing under
App registrations
. This step might take a few minutes.Click on the resource to access its properties. In particular, you will need the application ID. You can find this ID in the
Overview
tab (accessible from the list on the left of the page).Create an environment variable called
HIML_SERVICE_PRINCIPAL_ID
, and set its value to the application ID you just saw.You need to create an application secret to access the resources managed by this service principal. On the pane on the left find
Certificates & Secrets
. Click on+ New client secret
(bottom of the page), note down your token. Warning: this token will only appear once at the creation of the token, you will not be able to re-display it again later.Create an environment variable called
HIML_SERVICE_PRINCIPAL_PASSWORD
, and set its value to the token you just added.
Providing permissions to the Service Principal
Now that your service principal is created, you need to give permission for it to access and manage your AzureML workspace. To do so:
Go to your AzureML workspace. To find it you can type the name of your workspace in the search bar above.
On the
Overview
page, there is a link to the Resource Group that contains the workspace. Click on that.When on the Resource Group, navigate to
Access control
. Then click on+ Add
>Add role assignment
. A pane will appear on the the right. SelectRole > Contributor
. In theSelect
field type the name of your Service Principal and select it. Finish by clickingSave
at the bottom of the pane.
Azure Tenant ID
The last remaining piece is the Azure tenant ID, which also needs to be available in an environment variable. To get that ID:
Log into Azure
Via the search bar, find “Azure Active Directory” and open it.
In the overview of that, you will see a field “Tenant ID”
Create an environment variable called
HIML_TENANT_ID
, and set that to the tenant ID you just saw.
Datasets
Key concepts
We’ll first outline a few concepts that are helpful for understanding datasets.
Blob Storage
Firstly, there is Azure Blob Storage.
Each blob storage account has multiple containers - you can think of containers as big disks that store files.
The hi-ml
package assumes that your datasets live in one of those containers, and each top level folder corresponds
to one dataset.
AzureML Data Stores
Secondly, there are data stores. This is a concept coming from Azure Machine Learning, described here. Data stores provide access to one blob storage account. They exist so that the credentials to access blob storage do not have to be passed around in the code - rather, the credentials are stored in the data store once and for all.
You can view all data stores in your AzureML workspace by clicking on one of the bottom icons in the left-hand navigation bar of the AzureML studio.
One of these data stores is designated as the default data store.
AzureML Datasets
Thirdly, there are datasets. Again, this is a concept coming from Azure Machine Learning. A dataset is defined by
A data store
A set of files accessed through that data store
You can view all datasets in your AzureML workspace by clicking on one of the icons in the left-hand navigation bar of the AzureML studio.
Preparing data
To simplify usage, the hi-ml
package creates AzureML datasets for you. All you need to do is to
Create a blob storage account for your data, and within it, a container for your data.
Create a data store that points to that storage account, and store the credentials for the blob storage account in it
From that point on, you can drop a folder of files in the container that holds your data. Within the hi-ml
package,
just reference the name of the folder, and the package will create a dataset for you, if it does not yet exist.
Using the datasets
The simplest way of specifying that your script uses a folder of data from blob storage is as follows: Add the
input_datasets
argument to your call of submit_to_azure_if_needed
like this:
from health.azure import submit_to_azure_if_needed
run_info = submit_to_azure_if_needed(...,
input_datasets=["my_folder"],
default_datastore="my_datastore")
input_folder = run_info.input_datasets[0]
What will happen under the hood?
The toolbox will check if there is already an AzureML dataset called “my_folder”. If so, it will use that. If there is no dataset of that name, it will create one from all the files in blob storage in folder “my_folder”. The dataset will be created using the data store provided, “my_datastore”.
Once the script runs in AzureML, it will download the dataset “my_folder” to a temporary folder.
You can access this temporary location by
run_info.input_datasets[0]
, and read the files from it.
More complicated setups are described below.
Input and output datasets
Any run in AzureML can consume a number of input datasets. In addition, an AzureML run can also produce an output dataset (or even more than one).
Output datasets are helpful if you would like to run, for example, a script that transforms one dataset into another.
You can use that via the output_datasets
argument:
from health.azure import submit_to_azure_if_needed
run_info = submit_to_azure_if_needed(...,
input_datasets=["my_folder"],
output_datasets=["new_dataset"],
default_datastore="my_datastore")
input_folder = run_info.input_datasets[0]
output_folder = run_info.output_datasets[0]
Your script can now read files from input_folder
, transform them, and write them to output_folder
. The latter
will be a folder on the temp file system of the machine. At the end of the script, the contents of that temp folder
will be uploaded to blob storage, and registered as a dataset.
Mounting and downloading
An input dataset can be downloaded before the start of the actual script run, or it can be mounted. When mounted, the files are accessed via the network once needed - this is very helpful for large datasets where downloads would create a long waiting time before the job start.
Similarly, an output dataset can be uploaded at the end of the script, or it can be mounted. Mounting here means that all files will be written to blob storage already while the script runs (rather than at the end).
Note: If you are using mounted output datasets, you should NOT rename files in the output folder.
Mounting and downloading can be triggered by passing in DatasetConfig
objects for the input_datasets
argument,
like this:
from health.azure import DatasetConfig, submit_to_azure_if_needed
input_dataset = DatasetConfig(name="my_folder", datastore="my_datastore", use_mounting=True)
output_dataset = DatasetConfig(name="new_dataset", datastore="my_datastore", use_mounting=True)
run_info = submit_to_azure_if_needed(...,
input_datasets=[input_dataset],
output_datasets=[output_dataset])
input_folder = run_info.input_datasets[0]
output_folder = run_info.output_datasets[0]
Local execution
For debugging, it is essential to have the ability to run a script on a local machine, outside of AzureML. Clearly, your script needs to be able to access data in those runs too.
There are two ways of achieving that: Firstly, you can specific an equivalent local folder in the
DatasetConfig
objects:
from pathlib import Path
from health.azure import DatasetConfig, submit_to_azure_if_needed
input_dataset = DatasetConfig(name="my_folder",
datastore="my_datastore",
local_folder=Path("/datasets/my_folder_local"))
run_info = submit_to_azure_if_needed(...,
input_datasets=[input_dataset])
input_folder = run_info.input_datasets[0]
Secondly, you can check the returned path in run_info
, and replace it with something for local execution.
run_info.input_datasets[0]
will be None
if the script runs outside of AzureML, and no local_folder
is available.
from pathlib import Path
from health.azure import submit_to_azure_if_needed
run_info = submit_to_azure_if_needed(...,
input_datasets=["my_folder"],
default_datastore="my_datastore")
input_folder = run_info.input_datasets[0] or Path("/datasets/my_folder_local")
Making a dataset available at a fixed folder location
Occasionally, scripts expect the input dataset at a fixed location, for example, data is always read from /tmp/mnist
.
AzureML has the capability to download/mount a dataset to such a fixed location. With the hi-ml
package, you can
trigger that behaviour via an additional option in the DatasetConfig
objects:
from health.azure import DatasetConfig, submit_to_azure_if_needed
input_dataset = DatasetConfig(name="my_folder",
datastore="my_datastore",
use_mounting=True,
target_folder="/tmp/mnist")
run_info = submit_to_azure_if_needed(...,
input_datasets=[input_dataset])
# Input_folder will now be "/tmp/mnist"
input_folder = run_info.input_datasets[0]
Dataset versions
AzureML datasets can have versions, starting at 1. You can view the different versions of a dataset in the AzureML
workspace. In the hi-ml
toolbox, you would always use the latest version of a dataset unless specified otherwise.
If you do need a specific version, use the version
argument in the DatasetConfig
objects:
from health.azure import DatasetConfig, submit_to_azure_if_needed
input_dataset = DatasetConfig(name="my_folder",
datastore="my_datastore",
version=7)
run_info = submit_to_azure_if_needed(...,
input_datasets=[input_dataset])
input_folder = run_info.input_datasets[0]
Hyperparameter Search via Hyperdrive
HyperDrive runs
can start multiple AzureML jobs in parallel. This can be used for tuning hyperparameters, or executing multiple
training runs for cross validation. To use that with the hi-ml
package, simply supply a HyperDrive configuration
object as an additional argument. Note that this object needs to be created with an empty run_config
argument (this
will later be replaced with the correct run_config
that submits your script.)
The example below shows a hyperparameter search that aims to minimize the validation loss val_loss
, by choosing
one of three possible values for the learning rate commandline argument learning_rate
.
from azureml.core import ScriptRunConfig
from azureml.train.hyperdrive import GridParameterSampling, HyperDriveConfig, PrimaryMetricGoal, choice
from health.azure.himl import submit_to_azure_if_needed
hyperdrive_config = HyperDriveConfig(
run_config=ScriptRunConfig(source_directory=""),
hyperparameter_sampling=GridParameterSampling(
parameter_space={
"learning_rate": choice([0.1, 0.01, 0.001])
}),
primary_metric_name="val_loss",
primary_metric_goal=PrimaryMetricGoal.MINIMIZE,
max_total_runs=5
)
submit_to_azure_if_needed(..., hyperdrive_config=hyperdrive_config)
For further examples, please check the example scripts here, and the HyperDrive documentation.
Commandline tools
Run TensorBoard
From the command line, run the command
himl-tb
specifying one of
[--experiment_name] [--latest_run_file] [--run_recovery_ids]
This will start a TensorBoard session, by default running on port 6006. To use an alternative port, specify this with --port
.
If --experiment_name
is provided, the most recent Run from this experiment will be visualised.
If --latest_run_file
is provided, the script will expect to find a RunId in this file.
Alternatively you can specify the Runs to visualise via --run_recovery_ids
or --run_ids
.
You can specify the location where TensorBoard logs will be stored, using the --run_logs_dir
argument.
If you choose to specify --experiment_name
, you can also specify --num_runs
to view and/or --tags
to filter by.
If your AML config path is not ROOT_DIR/config.json, you must also specify --config_path
.
Download files from AML Runs
From the command line, run the command
himl-download
specifying one of
[--experiment_name] [--latest_run_file] [--run_recovery_ids] [--run_ids]
If --experiment_name
is provided, the most recent Run from this experiment will be downloaded.
If --latest_run_file
is provided, the script will expect to find a RunId in this file.
Alternatively you can specify the Runs to download via --run_recovery_ids
or --run_ids
.
The files associated with your Run(s) will be downloaded to the location specified with --output_dir
(by default ROOT_DIR/outputs)
If you choose to specify --experiment_name
, you can also specify --num_runs
to view and/or --tags
to filter by.
If your AML config path is not ROOT_DIR/config.json
, you must also specify --config_path
.
Examples
Note: All examples below contain links to sample scripts that are also included in the repository.
The experience is optimized for use on readthedocs. When navigating to the sample scripts on the github UI,
you will only see the .rst
file that links to the .py
file. To access the .py
file, go to the folder that
contains the respective .rst
file.
Basic integration
The sample examples/1/sample.py is a script that takes an optional command line argument of a target value and prints all the prime numbers up to (but not including) this target. It is simply intended to demonstrate a long running operation that we want to run in Azure. Run it using e.g.
cd examples/1
python sample.py -n 103
The sample examples/2/sample.py shows the minimal modifications to run this in AzureML. Firstly create an AzureML workspace and download the config file, as explained here. The config file should be placed in the same folder as the sample script. A sample Conda environment file is supplied. Import the hi-ml package into the current environment. Finally add the following to the sample script:
from health.azure.himl import submit_to_azure_if_needed
...
def main() -> None:
_ = submit_to_azure_if_needed(
compute_cluster_name="lite-testing-ds2",
wait_for_completion=True,
wait_for_completion_show_output=True)
Replace lite-testing-ds2
with the name of a compute cluster created within the AzureML workspace.
If this script is invoked as the first sample, e.g.
cd examples/2
python sample.py -n 103
then the output will be exactly the same. But if the script is invoked as follows:
cd examples/2
python sample.py -n 103 --azureml
then the function submit_to_azure_if_needed
will perform all the required actions to run this script in AzureML and exit. Note that:
code after
submit_to_azure_if_needed
is not run locally, but it is run in AzureML.the print statement prints to the AzureML console output and is available in the
Output + logs
tab of the experiment in the70_driver_log.txt
file, and can be downloaded from there.the command line arguments are passed through (apart from –azureml) when running in AzureML.
a new file:
most_recent_run.txt
will be created containing an identifier of this AzureML run.
A sample script examples/2/results.py demonstrates how to programmatically download the driver log file.
Output files
The sample examples/3/sample.py demonstrates output file handling when running on AzureML. Because each run is performed in a separate VM or cluster then any file output is not generally preserved. In order to keep the output it should be written to the outputs
folder when running in AzureML. The AzureML infrastructure will preserve this and it will be available for download from the outputs
folder in the Output + logs
tab.
Make the following additions:
run_info = submit_to_azure_if_needed(
...
parser.add_argument("-o", "--output", type=str, default="primes.txt", required=False, help="Output file name")
...
output = run_info.output_folder / args.output
output.write_text("\n".join(map(str, primes)))
When running locally submit_to_azure_if_needed
will create a subfolder called outputs
and then the output can be written to the file args.output
there. When running in AzureML the output will be available in the file args.output
in the Experiment.
A sample script examples/3/results.py demonstrates how to programmatically download the output file.
Output datasets
The sample examples/4/sample.py demonstrates output dataset handling when running on AzureML.
In this case, the following parameters are added to submit_to_azure_if_needed
:
run_info = submit_to_azure_if_needed(
...
default_datastore="himldatasets",
output_datasets=["himl_sample4_output"],
The default_datastore
is required if using the simplest configuration for an output dataset, to just use the blob container name. There is an alternative that doesn’t require the default_datastore
and allows a different datastore for each dataset:
from health.azure.datasets import DatasetConfig
...
run_info = submit_to_azure_if_needed(
...
output_datasets=[DatasetConfig(name="himl_sample4_output", datastore="himldatasets")]
Now the output folder is constructed as follows:
output_folder = run_info.output_datasets[0] or Path("outputs") / "himl_sample4_output"
output_folder.mkdir(parents=True, exist_ok=True)
output = output_folder / args.output
When running in AzureML run_info.output_datasets[0]
will be populated using the new parameter and the output will be written to that blob storage. When running locally run_info.output_datasets[0]
will be None and a local folder will be created and used.
A sample script examples/4/results.py demonstrates how to programmatically download the output dataset file.
For more details about datasets, see here
Input datasets
This example trains a simple classifier on a toy dataset, first creating the dataset files and then in a second script training the classifier.
The script examples/5/inputs.py is provided to prepare the csv files. Run the script to download the Iris dataset and create two CSV files:
cd examples/5
python inputs.py
The training script examples/5/sample.py is modified from https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/ml-frameworks/scikit-learn/train-hyperparameter-tune-deploy-with-sklearn/train_iris.py to work with input csv files. Start it to train the actual classifier, based on the data files that were just written:
cd examples/5
python sample.py
Including input files in the snapshot
When using very small datafiles (in the order of few MB), the easiest way to get the input data to Azure is to include them in the set of (source) files that are uploaded to Azure. You can run the dataset creation script on your local machine, writing the resulting two files to the same folder where your training script is located, and then submit the training script to AzureML. Because the dataset files are in the same folder, they will automatically be uploaded to AzureML.
However, it is not ideal to have the input files in the snapshot: The size of the snapshot is limited to 25 MB. It is better to put the data files into blob storage and use input datasets.
Creating the dataset in AzureML
The suggested way of creating a dataset is to run a script in AzureML that writes an output dataset. This is particularly important for large datasets, to avoid the usually low bandwith from a local machine to the cloud.
This is shown in examples/6/inputs.py:
This script prepares the CSV files in an AzureML run, and writes them to an output dataset called himl_sample6_input
.
The relevant code parts are:
run_info = submit_to_azure_if_needed(
compute_cluster_name="lite-testing-ds2",
default_datastore="himldatasets",
output_datasets=["himl_sample6_input"])
# The dataset files should be written into this folder:
dataset = run_info.output_datasets[0] or Path("dataset")
Run the script:
cd examples/6
python inputs.py --azureml
You can now modify the training script examples/6/sample.py to use the newly created dataset
himl_sample6_input
as an input. To do that, the following parameters are added to submit_to_azure_if_needed
:
run_info = submit_to_azure_if_needed(
compute_cluster_name="lite-testing-ds2",
default_datastore="himldatasets",
input_datasets=["himl_sample6_input"])
When running in AzureML, the dataset will be downloaded before running the job. You can access the temporary folder where the dataset is available like this:
input_folder = run_info.input_datasets[0] or Path("dataset")
The part behind the or
statement is only necessary to keep a reasonable behaviour when running outside of AzureML:
When running in AzureML run_info.input_datasets[0]
will be populated using input dataset specified in the call to
submit_to_azure_if_needed
, and the input will be downloaded from blob storage. When running locally
run_info.input_datasets[0]
will be None
and a local folder should be populated and used.
The default_datastore
is required if using the simplest configuration for an input dataset. There are
alternatives that do not require the default_datastore
and allows a different datastore for each dataset, for example:
from health.azure.datasets import DatasetConfig
...
run_info = submit_to_azure_if_needed(
...
input_datasets=[DatasetConfig(name="himl_sample7_input", datastore="himldatasets"],
For more details about datasets, see here
Uploading the input files manually
An alternative to writing the dataset in AzureML (as suggested above) is to create them on the local machine, and upload them manually directly to Azure blob storage.
This is shown in examples/7/inputs.py: This script prepares the CSV files
and uploads them to blob storage, in a folder called himl_sample7_input
. Run the script:
cd examples/7
python inputs_via_upload.py
As in the above example, you can now modify the training script examples/7/sample.py to use
an input dataset that has the same name as the folder where the files just got uploaded. In this case, the following
parameters are added to submit_to_azure_if_needed
:
run_info = submit_to_azure_if_needed(
...
default_datastore="himldatasets",
input_datasets=["himl_sample7_input"],
Hyperdrive
The sample examples/8/sample.py demonstrates adding hyperparameter tuning. This shows the same hyperparameter search as in the AzureML sample.
Make the following additions:
from azureml.core import ScriptRunConfig
from azureml.train.hyperdrive import HyperDriveConfig, PrimaryMetricGoal, choice
from azureml.train.hyperdrive.sampling import RandomParameterSampling
...
def main() -> None:
param_sampling = RandomParameterSampling({
"--kernel": choice('linear', 'rbf', 'poly', 'sigmoid'),
"--penalty": choice(0.5, 1, 1.5)
})
hyperdrive_config = HyperDriveConfig(
run_config=ScriptRunConfig(source_directory=""),
hyperparameter_sampling=param_sampling,
primary_metric_name='Accuracy',
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
max_total_runs=12,
max_concurrent_runs=4)
run_info = submit_to_azure_if_needed(
...
hyperdrive_config=hyperdrive_config)
Note that this does not make sense to run locally, it should always be run in AzureML. When invoked with:
cd examples/8
python sample.py --azureml
this will perform a Hyperdrive run in AzureML, i.e. there will be 12 child runs, each randomly drawing from the parameter sample space. AzureML can plot the metrics from the child runs, but to do that, some small modifications are required.
Add in:
run = run_info.run
...
args = parser.parse_args()
run.log('Kernel type', np.str(args.kernel))
run.log('Penalty', np.float(args.penalty))
...
print('Accuracy of SVM classifier on test set: {:.2f}'.format(accuracy))
run.log('Accuracy', np.float(accuracy))
and these metrics will be displayed on the child runs tab in the Experiment page on AzureML.
Controlling when to submit to AzureML and when not
By default, the hi-ml
package assumes that you supply a commandline argument --azureml
(that can be anywhere on
the commandline) to trigger a submission of the present script to AzureML. If you wish to control it via a different
flag, coming out of your own argument parser, use the submit_to_azureml
argument of the function
health.azure.himl.submit_to_azure_if_needed
.
Notes for developers
Creating a Conda environment
To create a separate Conda environment with all packages that hi-ml
requires for running and testing,
use the provided environment.yml
file. Create a Conda environment called himl
from that via
conda env create --file environment.yml
conda activate himl
Using specific versions hi-ml
in your Python environments
If you’d like to test specific changes to the hi-ml
package in your code, you can use two different routes:
You can clone the
hi-ml
repository on your machine, and usehi-ml
in your Python environment via a local package install:
pip install -e <your_git_folder>/hi-ml
You can consume an early version of the package from
test.pypi.org
viapip
:
pip install --extra-index-url https://test.pypi.org/simple/ hi-ml==0.1.0.post165
If you are using Conda, you can add an additional parameter for
pip
into the Condaenvironment.yml
file like this:
name: foo
dependencies:
- pip=20.1.1
- python=3.7.3
- pip:
- --extra-index-url https://test.pypi.org/simple/
- hi-ml==0.1.0.post165
Common things to do
The repository contains a makefile with definitions for common operations.
make check
: Runflake8
andmypy
on the repository.make test
: Runflake8
andmypy
on the repository, then all tests viapytest
make pip
: Install all packages for running and testing in the current interpreter.make conda
: Update the hi-ml Conda environment and activate it
Building documentation
To build the sphinx documentation, you must have sphinx and related packages installed
(see build_requirements.txt
in the repository root). Then run:
cd docs
make html
This will build all your documentation in docs/build/html
.
Setting up your AzureML workspace
In the browser, navigate to the AzureML workspace that you want to use for running your tests.
In the top right section, there will be a dropdown menu showing the name of your AzureML workspace. Expand that.
In the panel, there is a link “Download config file”. Click that.
This will download a file
config.json
. Move that file to the root folder of yourhi-ml
repository. The file name is already present in.gitignore
, and will hence not be checked in.
Creating and Deleting Docker Environments in AzureML
Passing a
docker_base_image
intosubmit_to_azure_if_needed
causes a new image to be built and registered in your workspace (see docs for more information).To remove an environment use the az ml environment delete function in the AzureML CLI (note that all the parameters need to be set, none are optional).
Testing
For all of the tests to work locally you will need to cache your AzureML credentials. One simple way to do this is to
run the example in src/health/azure/examples
(i.e. run python elevate_this.py --message='Hello World' --azureml
or
make example
) after editing elevate_this.py
to reference your compute cluster.
When running the tests locally, they can either be run against the source directly, or the source built into a package.
To run the tests against the source directly in the local
src
folder, ensure that there is no wheel in thedist
folder (for example by runningmake clean
). If a wheel is not detected, then the localsrc
folder will be copied into the temporary test folder as part of the test process.To run the tests against the source as a package, build it with
make build
. This will build the localsrc
folder into a new wheel in thedist
folder. This wheel will be detected and passed to AzureML as a private package as part of the test process.
Contributing to this toolbox
We welcome all contributions that help us achieve our aim of speeding up ML/AI research in health and life sciences. Examples of contributions are
Data loaders for specific health & life sciences data
Network architectures and components for deep learning models
Tools to analyze and/or visualize data
All contributions to the toolbox need to come with unit tests, and will be reviewed when a Pull Request (PR) is started.
If in doubt, reach out to the core hi-ml
team before starting your work.
Please look through the existing folder structure to find a good home for your contribution.
Submitting a Pull Request
If you’d like to submit a PR to the codebase, please ensure you:
Include a brief description
Link to an issue, if relevant
Write unit tests for the code - see below for details.
Add appropriate documentation for any new code that you introduce
Code style
We use
flake8
as a linter, andmypy
for static typechecking. Both tools run as part of the PR build, and must run without errors for a contribution to be accepted.mypy
requires that all functions and methods carry type annotations, see mypy documentation.We highly recommend to run both tools before pushing the latest changes to a PR. If you have
make
installed, you can run both tools in one go viamake check
(from the repository root folder)Code should use sphinx-style comments like this:
from typing import List, Optional
def foo(bar: int) -> Optional[List]:
"""
Creates a list. Or not. Note that there must a be blank line after the summary.
:param bar: The length of the list. If 0, returns None. If there is a very long doc string for an argument, the next
line must be indented.
:return: A list with `bar` elements.
"""
Note the blank line after the summary, and the indentation of multi-line documentation for parameters.
Unit testing
DO write unit tests for each new function or class that you add.
DO extend unit tests for existing functions or classes if you change their core behaviour.
DO try your best to write unit tests that are fast. Very often, this can be done by reducing data size to a minimum. Also, it is helpful to avoid long-running integration tests, but try to test at the level of the smallest involved function.
DO ensure that your tests are designed in a way that they can pass on the local machine, even if they are relying on specific cloud features. If required, use
unittest.mock
to simulate the cloud features, and hence enable the tests to run successfully on your local machine.DO run all unit tests on your dev machine before submitting your changes. The test suite is designed to pass completely also outside of cloud builds.
DO NOT rely only on the test builds in the cloud (i.e., run test locally before submitting). Cloud builds trigger AzureML runs on GPU machines that have a far higher CO2 footprint than your dev machine.
API
health.azure Package
Functions
|
Finds an existing run in an experiment, based on a recovery ID that contains the experiment ID and the actual RunId. |
Sets the environment variables that PyTorch Lightning needs for multi-node training. |
|
Splits a run ID into the experiment name and the actual run. |
|
|
Creates an AzureML run configuration, that contains information about environment, multi node execution, and Docker. |
|
Creates an AzureML ScriptRunConfig object, that holds the information about the snapshot, the entry script, and its arguments. |
|
Obtain the AzureML workspace from either the passed in value or the passed in path :type aml_workspace: |
|
Starts an AzureML run on a given workspace, via the script_run_config. |
Submit a folder to Azure, if needed and run it. |
Classes
|
Contains information to use AzureML datasets as inputs or outputs. |
|
This class stores all information that a script needs to run inside and outside of AzureML. |