Examples

Note: All examples below contain links to sample scripts that are also included in the repository. The experience is optimized for use on readthedocs. When navigating to the sample scripts on the github UI, you will only see the .rst file that links to the .py file. To access the .py file, go to the folder that contains the respective .rst file.

Basic integration

The sample examples/1/sample.py is a script that takes an optional command line argument of a target value and prints all the prime numbers up to (but not including) this target. It is simply intended to demonstrate a long running operation that we want to run in Azure. Run it using e.g.

cd examples/1
python sample.py -n 103

The sample examples/2/sample.py shows the minimal modifications to run this in AzureML. Firstly create an AzureML workspace and download the config file, as explained here. The config file should be placed in the same folder as the sample script. A sample Conda environment file is supplied. Import the hi-ml package into the current environment. Finally add the following to the sample script:

from health.azure.himl import submit_to_azure_if_needed
    ...
def main() -> None:
    _ = submit_to_azure_if_needed(
        compute_cluster_name="lite-testing-ds2",
        wait_for_completion=True,
        wait_for_completion_show_output=True)

Replace lite-testing-ds2 with the name of a compute cluster created within the AzureML workspace. If this script is invoked as the first sample, e.g.

cd examples/2
python sample.py -n 103

then the output will be exactly the same. But if the script is invoked as follows:

cd examples/2
python sample.py -n 103 --azureml

then the function submit_to_azure_if_needed will perform all the required actions to run this script in AzureML and exit. Note that:

  • code after submit_to_azure_if_needed is not run locally, but it is run in AzureML.

  • the print statement prints to the AzureML console output and is available in the Output + logs tab of the experiment in the 70_driver_log.txt file, and can be downloaded from there.

  • the command line arguments are passed through (apart from –azureml) when running in AzureML.

  • a new file: most_recent_run.txt will be created containing an identifier of this AzureML run.

A sample script examples/2/results.py demonstrates how to programmatically download the driver log file.

Output files

The sample examples/3/sample.py demonstrates output file handling when running on AzureML. Because each run is performed in a separate VM or cluster then any file output is not generally preserved. In order to keep the output it should be written to the outputs folder when running in AzureML. The AzureML infrastructure will preserve this and it will be available for download from the outputs folder in the Output + logs tab.

Make the following additions:

    run_info = submit_to_azure_if_needed(
    ...
    parser.add_argument("-o", "--output", type=str, default="primes.txt", required=False, help="Output file name")
    ...
    output = run_info.output_folder / args.output
    output.write_text("\n".join(map(str, primes)))

When running locally submit_to_azure_if_needed will create a subfolder called outputs and then the output can be written to the file args.output there. When running in AzureML the output will be available in the file args.output in the Experiment.

A sample script examples/3/results.py demonstrates how to programmatically download the output file.

Output datasets

The sample examples/4/sample.py demonstrates output dataset handling when running on AzureML.

In this case, the following parameters are added to submit_to_azure_if_needed:

    run_info = submit_to_azure_if_needed(
        ...
        default_datastore="himldatasets",
        output_datasets=["himl_sample4_output"],

The default_datastore is required if using the simplest configuration for an output dataset, to just use the blob container name. There is an alternative that doesn’t require the default_datastore and allows a different datastore for each dataset:

from health.azure.datasets import DatasetConfig
    ...
    run_info = submit_to_azure_if_needed(
        ...
        output_datasets=[DatasetConfig(name="himl_sample4_output", datastore="himldatasets")]

Now the output folder is constructed as follows:

    output_folder = run_info.output_datasets[0] or Path("outputs") / "himl_sample4_output"
    output_folder.mkdir(parents=True, exist_ok=True)
    output = output_folder / args.output

When running in AzureML run_info.output_datasets[0] will be populated using the new parameter and the output will be written to that blob storage. When running locally run_info.output_datasets[0] will be None and a local folder will be created and used.

A sample script examples/4/results.py demonstrates how to programmatically download the output dataset file.

For more details about datasets, see here

Input datasets

This example trains a simple classifier on a toy dataset, first creating the dataset files and then in a second script training the classifier.

The script examples/5/inputs.py is provided to prepare the csv files. Run the script to download the Iris dataset and create two CSV files:

cd examples/5
python inputs.py

The training script examples/5/sample.py is modified from https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/ml-frameworks/scikit-learn/train-hyperparameter-tune-deploy-with-sklearn/train_iris.py to work with input csv files. Start it to train the actual classifier, based on the data files that were just written:

cd examples/5
python sample.py

Including input files in the snapshot

When using very small datafiles (in the order of few MB), the easiest way to get the input data to Azure is to include them in the set of (source) files that are uploaded to Azure. You can run the dataset creation script on your local machine, writing the resulting two files to the same folder where your training script is located, and then submit the training script to AzureML. Because the dataset files are in the same folder, they will automatically be uploaded to AzureML.

However, it is not ideal to have the input files in the snapshot: The size of the snapshot is limited to 25 MB. It is better to put the data files into blob storage and use input datasets.

Creating the dataset in AzureML

The suggested way of creating a dataset is to run a script in AzureML that writes an output dataset. This is particularly important for large datasets, to avoid the usually low bandwith from a local machine to the cloud.

This is shown in examples/6/inputs.py: This script prepares the CSV files in an AzureML run, and writes them to an output dataset called himl_sample6_input. The relevant code parts are:

run_info = submit_to_azure_if_needed(
    compute_cluster_name="lite-testing-ds2",
    default_datastore="himldatasets",
    output_datasets=["himl_sample6_input"])
# The dataset files should be written into this folder:
dataset = run_info.output_datasets[0] or Path("dataset")

Run the script:

cd examples/6
python inputs.py --azureml

You can now modify the training script examples/6/sample.py to use the newly created dataset himl_sample6_input as an input. To do that, the following parameters are added to submit_to_azure_if_needed:

run_info = submit_to_azure_if_needed(
    compute_cluster_name="lite-testing-ds2",
    default_datastore="himldatasets",
    input_datasets=["himl_sample6_input"])

When running in AzureML, the dataset will be downloaded before running the job. You can access the temporary folder where the dataset is available like this:

input_folder = run_info.input_datasets[0] or Path("dataset")

The part behind the or statement is only necessary to keep a reasonable behaviour when running outside of AzureML: When running in AzureML run_info.input_datasets[0] will be populated using input dataset specified in the call to submit_to_azure_if_needed, and the input will be downloaded from blob storage. When running locally run_info.input_datasets[0] will be None and a local folder should be populated and used.

The default_datastore is required if using the simplest configuration for an input dataset. There are alternatives that do not require the default_datastore and allows a different datastore for each dataset, for example:

from health.azure.datasets import DatasetConfig
    ...
    run_info = submit_to_azure_if_needed(
        ...
        input_datasets=[DatasetConfig(name="himl_sample7_input", datastore="himldatasets"],

For more details about datasets, see here

Uploading the input files manually

An alternative to writing the dataset in AzureML (as suggested above) is to create them on the local machine, and upload them manually directly to Azure blob storage.

This is shown in examples/7/inputs.py: This script prepares the CSV files and uploads them to blob storage, in a folder called himl_sample7_input. Run the script:

cd examples/7
python inputs_via_upload.py

As in the above example, you can now modify the training script examples/7/sample.py to use an input dataset that has the same name as the folder where the files just got uploaded. In this case, the following parameters are added to submit_to_azure_if_needed:

    run_info = submit_to_azure_if_needed(
        ...
        default_datastore="himldatasets",
        input_datasets=["himl_sample7_input"],

Hyperdrive

The sample examples/8/sample.py demonstrates adding hyperparameter tuning. This shows the same hyperparameter search as in the AzureML sample.

Make the following additions:

from azureml.core import ScriptRunConfig
from azureml.train.hyperdrive import HyperDriveConfig, PrimaryMetricGoal, choice
from azureml.train.hyperdrive.sampling import RandomParameterSampling
    ...
def main() -> None:
    param_sampling = RandomParameterSampling({
        "--kernel": choice('linear', 'rbf', 'poly', 'sigmoid'),
        "--penalty": choice(0.5, 1, 1.5)
    })

    hyperdrive_config = HyperDriveConfig(
        run_config=ScriptRunConfig(source_directory=""),
        hyperparameter_sampling=param_sampling,
        primary_metric_name='Accuracy',
        primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
        max_total_runs=12,
        max_concurrent_runs=4)

    run_info = submit_to_azure_if_needed(
        ...
        hyperdrive_config=hyperdrive_config)

Note that this does not make sense to run locally, it should always be run in AzureML. When invoked with:

cd examples/8
python sample.py --azureml

this will perform a Hyperdrive run in AzureML, i.e. there will be 12 child runs, each randomly drawing from the parameter sample space. AzureML can plot the metrics from the child runs, but to do that, some small modifications are required.

Add in:

    run = run_info.run
    ...
    args = parser.parse_args()
    run.log('Kernel type', np.str(args.kernel))
    run.log('Penalty', np.float(args.penalty))
    ...
    print('Accuracy of SVM classifier on test set: {:.2f}'.format(accuracy))
    run.log('Accuracy', np.float(accuracy))

and these metrics will be displayed on the child runs tab in the Experiment page on AzureML.

Controlling when to submit to AzureML and when not

By default, the hi-ml package assumes that you supply a commandline argument --azureml (that can be anywhere on the commandline) to trigger a submission of the present script to AzureML. If you wish to control it via a different flag, coming out of your own argument parser, use the submit_to_azureml argument of the function health.azure.himl.submit_to_azure_if_needed.