health_azure Package

class health_azure.AzureRunInfo(input_datasets, output_datasets, mount_contexts, run, is_running_in_azure_ml, output_folder, logs_folder)[source]

This class stores all information that a script needs to run inside and outside of AzureML. It is return from submit_to_azure_if_needed, where the return value depends on whether the script is inside or outside AzureML.

Please check the source code for detailed documentation for all fields.

input_datasets: List[Optional[pathlib.Path]]

A list of folders that contain all the datasets that the script uses as inputs. Input datasets must be specified when calling submit_to_azure_if_needed. Here, they are made available as Path objects. If no input datasets are specified, the list is empty.

is_running_in_azure_ml: bool

If True, the present script is executing inside AzureML. If False, outside AzureML.

logs_folder: Optional[pathlib.Path]

The folder into which all log files (for example, tensorboard) should be written. All files written to this folder will be uploaded to blob storage regularly during the script run.

mount_contexts: List[azureml.dataprep.fuse.daemon.MountContext]

A list of mount contexts for input datasets when running outside AzureML. There will be a mount context for each input dataset where there is no local_folder, there is a workspace, and use_mounting is set. This list is maintained only to prevent exit from these contexts until the RunInfo object is deleted.

output_datasets: List[Optional[pathlib.Path]]

A list of folders that contain all the datasets that the script uses as outputs. Output datasets must be specified when calling submit_to_azure_if_needed. Here, they are made available as Path objects. If no output datasets are specified, the list is empty.

output_folder: Optional[pathlib.Path]

The output folder into which all script outputs should be written, if they should be later available in the AzureML portal. Files written to this folder will be uploaded to blob storage at the end of the script run.

run: Optional[azureml.core.run.Run]

An AzureML Run object if the present script is executing inside AzureML, or None if outside of AzureML. The Run object has methods to log metrics, upload files, etc.

class health_azure.DatasetConfig(name, datastore='', overwrite_existing=True, version=None, use_mounting=None, target_folder=None, local_folder=None)[source]

Contains information to use AzureML datasets as inputs or outputs.

Parameters
  • name (str) – The name of the dataset, as it was registered in the AzureML workspace. For output datasets, this will be the name given to the newly created dataset.

  • datastore (str) – The name of the AzureML datastore that holds the dataset. This can be empty if the AzureML workspace has only a single datastore, or if the default datastore should be used.

  • overwrite_existing (bool) – Only applies to uploading datasets. If True, the dataset will be overwritten if it already exists. If False, the dataset creation will fail if the dataset already exists.

  • version (Optional[int]) – The version of the dataset that should be used. This is only used for input datasets. If the version is not specified, the latest version will be used.

  • use_mounting (Optional[bool]) – If True, the dataset will be “mounted”, that is, individual files will be read or written on-demand over the network. If False, the dataset will be fully downloaded before the job starts, respectively fully uploaded at job end for output datasets. Defaults: False (downloading) for datasets that are script inputs, True (mounting) for datasets that are script outputs.

  • target_folder (Union[Path, str, None]) – The folder into which the dataset should be downloaded or mounted. If left empty, a random folder on /tmp will be chosen. Do NOT use “.” as the target_folder.

  • local_folder (Union[Path, str, None]) – The folder on the local machine at which the dataset is available. This is used only for runs outside of AzureML. If this is empty then the target_folder will be used to mount or download the dataset.

to_input_dataset(dataset_index, workspace, strictly_aml_v1, ml_client=None)[source]

Creates a configuration for using an AzureML dataset inside of an AzureML run. This will make the AzureML dataset with given name available as a named input, using INPUT_0 as the key for dataset index 0.

Parameters
  • workspace (Workspace) – The AzureML workspace to read from.

  • dataset_index (int) – Suffix for using datasets as named inputs, the dataset will be marked INPUT_{index}

  • strictly_aml_v1 (bool) – If True, use Azure ML SDK v1. Otherwise, attempt to use Azure ML SDK v2.

  • ml_client (Optional[MLClient]) – An Azure MLClient object for interacting with Azure resources.

Return type

Optional[DatasetConsumptionConfig]

to_input_dataset_local(workspace)[source]

Return a local path to the dataset when outside of an AzureML run. If local_folder is supplied, then this is assumed to be a local dataset, and this is returned. Otherwise the dataset is mounted or downloaded to either the target folder or a temporary folder and that is returned. If self.name refers to a v2 dataset, it is not possible to mount the data here, therefore a tuple of Nones will be returned.

Parameters

workspace (Workspace) – The AzureML workspace to read from.

Return type

Tuple[Path, Optional[MountContext]]

Returns

Tuple of (path to dataset, optional mountcontext)

to_output_dataset(workspace, dataset_index)[source]

Creates a configuration to write a script output to an AzureML dataset. The name and datastore of this new dataset will be taken from the present object.

Parameters
  • workspace (Workspace) – The AzureML workspace to read from.

  • dataset_index (int) – Suffix for using datasets as named inputs, the dataset will be marked OUTPUT_{index}

Return type

OutputFileDatasetConfig

Returns

An AzureML OutputFileDatasetConfig object, representing the output dataset.

health_azure.aggregate_hyperdrive_metrics(child_run_arg_name, run_id=None, run=None, keep_metrics=None, aml_workspace=None, workspace_config_path=None)[source]

For a given HyperDriveRun object, or id of a HyperDriveRun, retrieves the metrics from each of its children and then aggregates it. Optionally filters the metrics logged in the Run, by providing a list of metrics to keep. Returns a DataFrame where each column is one child run, and each row is a metric logged by that child run. For example, for a HyperDrive run with 2 children, where each logs epoch, accuracy and loss, the result would look like:

|              | 0               | 1                  |
|--------------|-----------------|--------------------|
| epoch        | [1, 2, 3]       | [1, 2, 3]          |
| accuracy     | [0.7, 0.8, 0.9] | [0.71, 0.82, 0.91] |
| loss         | [0.5, 0.4, 0.3] | [0.45, 0.37, 0.29] |

here each column is one of the splits/ child runs, and each row is one of the metrics you have logged to the run.

It is possible to log rows and tables in Azure ML by calling run.log_table and run.log_row respectively. In this case, the DataFrame will contain a Dictionary entry instead of a list, where the keys are the table columns (or keywords provided to log_row), and the values are the table values. E.g.:

|                | 0                                        | 1                                         |
|----------------|------------------------------------------|-------------------------------------------|
| accuracy_table |{'epoch': [1, 2], 'accuracy': [0.7, 0.8]} | {'epoch': [1, 2], 'accuracy': [0.8, 0.9]} |

It is also possible to log plots in Azure ML by calling run.log_image and passing in a matplotlib plot. In this case, the DataFrame will contain a string representing the path to the artifact that is generated by AML (the saved plot in the Logs & Outputs pane of your run on the AML portal). E.g.:

|                | 0                                       | 1                                     |
|----------------|-----------------------------------------|---------------------------------------|
| accuracy_plot  | aml://artifactId/ExperimentRun/dcid.... | aml://artifactId/ExperimentRun/dcid...|
Parameters
  • child_run_arg_name (str) – the name of the argument given to each child run to denote its position relative to other child runs (e.g. this arg could equal ‘child_run_index’ - then each of your child runs should expect to receive the arg ‘–child_run_index’ with a value <= the total number of child runs)

  • run (Optional[Run]) – An Azure ML HyperDriveRun object to aggregate the metrics from. Either this or run_id must be provided

  • run_id (Optional[str]) – The id (type: str) of a parent/ HyperDrive run. Either this or run must be provided.

  • keep_metrics (Optional[List[str]]) – An optional list of metric names to filter the returned metrics by

  • aml_workspace (Optional[Workspace]) – If run_id is provided, this is an optional AML Workspace object to retrieve the Run from

  • workspace_config_path (Optional[Path]) – If run_id is provided, this is an optional path to a config containing details of the AML Workspace object to retrieve the Run from.

Return type

DataFrame

Returns

A Pandas DataFrame containing the aggregated metrics from each child run

health_azure.create_aml_run_object(experiment_name, run_name=None, workspace=None, workspace_config_path=None, snapshot_directory=None)[source]

Creates an AzureML Run object in the given workspace, or in the workspace given by the AzureML config file. This Run object can be used to write metrics to AzureML, upload files, etc, when the code is not running in AzureML. After finishing all operations, use run.flush() to write metrics to the cloud, and run.complete() or run.fail().

Example: >>>run = create_aml_run_object(experiment_name=”run_on_my_vm”, run_name=”try1”) >>>run.log(“foo”, 1.23) >>>run.flush() >>>run.complete()

Parameters
  • experiment_name (str) – The AzureML experiment that should hold the run that will be created.

  • run_name (Optional[str]) – An optional name for the run (this will be used as the display name in the AzureML UI)

  • workspace (Optional[Workspace]) – If provided, use this workspace to create the run in. If not provided, use the workspace specified by the config.json file in the folder or its parent folder(s).

  • workspace_config_path (Optional[Path]) – If not provided with an AzureML workspace, then load one given the information in this config file.

  • snapshot_directory (Union[Path, str, None]) – The folder that should be included as the code snapshot. By default, no snapshot is created (snapshot_directory=None or snapshot_directory=””). Set this to the folder that contains all the code your experiment uses. You can use a file .amlignore to skip specific files or folders, akin to .gitignore

Return type

Run

Returns

An AzureML Run object.

health_azure.create_crossval_hyperdrive_config(num_splits, cross_val_index_arg_name='crossval_index', metric_name='val/loss')[source]

Creates an Azure ML HyperDriveConfig object for running cross validation. Note: this config expects a metric named <metric_name> to be logged in your training script([see here]( https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters#log-metrics-for-hyperparameter-tuning))

Parameters
  • num_splits (int) – The number of splits for k-fold cross validation

  • cross_val_index_arg_name (str) – The name of the commandline argument that each of the child runs gets, to indicate which split they should work on.

  • metric_name (str) – The name of the metric that the HyperDriveConfig will compare runs by. Please note that it is your responsibility to make sure a metric with this name is logged to the Run in your training script

Return type

HyperDriveConfig

Returns

an Azure ML HyperDriveConfig object

health_azure.create_run_configuration(workspace, compute_cluster_name, conda_environment_file=None, aml_environment_name='', environment_variables=None, pip_extra_index_url='', private_pip_wheel_path=None, docker_base_image='', docker_shm_size='', num_nodes=1, max_run_duration='', input_datasets=None, output_datasets=None)[source]

Creates an AzureML run configuration, that contains information about environment, multi node execution, and Docker.

Parameters
  • workspace (Workspace) – The AzureML Workspace to use.

  • aml_environment_name (str) – The name of an AzureML environment that should be used to submit the script. If not provided, an environment will be created from the arguments to this function (conda_environment_file, pip_extra_index_url, environment_variables, docker_base_image)

  • max_run_duration (str) – The maximum runtime that is allowed for this job in AzureML. This is given as a floating point number with a string suffix s, m, h, d for seconds, minutes, hours, day. Examples: ‘3.5h’, ‘2d’

  • compute_cluster_name (str) – The name of the AzureML cluster that should run the job. This can be a cluster with CPU or GPU machines.

  • conda_environment_file (Optional[Path]) – The conda configuration file that describes which packages are necessary for your script to run.

  • environment_variables (Optional[Dict[str, str]]) – The environment variables that should be set when running in AzureML.

  • docker_base_image (str) – The Docker base image that should be used when creating a new Docker image.

  • docker_shm_size (str) – The Docker shared memory size that should be used when creating a new Docker image.

  • pip_extra_index_url (str) – If provided, use this PIP package index to find additional packages when building the Docker image.

  • private_pip_wheel_path (Optional[Path]) – If provided, add this wheel as a private package to the AzureML workspace.

  • conda_environment_file – The file that contains the Conda environment definition.

  • input_datasets (Optional[List[DatasetConfig]]) – The script will consume all data in folder in blob storage as the input. The folder must exist in blob storage, in the location that you gave when creating the datastore. Once the script has run, it will also register the data in this folder as an AzureML dataset.

  • output_datasets (Optional[List[DatasetConfig]]) – The script will create a temporary folder when running in AzureML, and while the job writes data to that folder, upload it to blob storage, in the data store.

  • num_nodes (int) – The number of nodes to use in distributed training on AzureML.

Return type

RunConfiguration

Returns

health_azure.create_script_run(script_params, snapshot_root_directory=None, entry_script=None)[source]

Creates an AzureML ScriptRunConfig object, that holds the information about the snapshot, the entry script, and its arguments.

Parameters
  • script_params (List[str]) – A list of parameter to pass on to the script as it runs in AzureML. Required arg. Script parameters can be generated using the _get_script_params() function.

  • snapshot_root_directory (Optional[Path]) – The directory that contains all code that should be packaged and sent to AzureML. All Python code that the script uses must be copied over.

  • entry_script (Union[Path, str, None]) – The script that should be run in AzureML. If None, the current main Python file will be executed.

Return type

ScriptRunConfig

Returns

health_azure.download_checkpoints_from_run_id(run_id, checkpoint_path_or_folder, output_folder, aml_workspace=None, workspace_config_path=None)[source]

Given an Azure ML run id, download all files from a given checkpoint directory within that run, to the path specified by output_path. If running in AML, will take the current workspace. Otherwise, if neither aml_workspace nor workspace_config_path are provided, will try to locate a config.json file in any of the parent folders of the current working directory.

Parameters
  • run_id (str) – The id of the run to download checkpoints from

  • checkpoint_path_or_folder (str) – The path to the either a single checkpoint file, or a directory of checkpoints within the run files. If a folder is provided, all files within it will be downloaded.

  • output_folder (Path) – The path to which the checkpoints should be stored

  • aml_workspace (Optional[Workspace]) – Optional AML workspace object

  • workspace_config_path (Optional[Path]) – Optional workspace config file

Return type

None

health_azure.download_files_from_run_id(run_id, output_folder, prefix='', workspace=None, workspace_config_path=None, validate_checksum=False)[source]

For a given Azure ML run id, first retrieve the Run, and then download all files, which optionally start with a given prefix. E.g. if the Run creates a folder called “outputs”, which you wish to download all files from, specify prefix=”outputs”. To download all files associated with the run, leave prefix empty.

If not running inside AML and neither a workspace nor the config file are provided, the code will try to locate a config.json file in any of the parent folders of the current working directory. If that succeeds, that config.json file will be used to instantiate the workspace.

If function is called in a distributed PyTorch training script, the files will only be downloaded once per node (i.e, all process where is_local_rank_zero() == True). All processes will exit this function once all downloads are completed.

Parameters
  • run_id (str) – The id of the Azure ML Run

  • output_folder (Path) – Local directory to which the Run files should be downloaded.

  • prefix (str) – Optional prefix to filter Run files by

  • workspace (Optional[Workspace]) – Optional Azure ML Workspace object

  • workspace_config_path (Optional[Path]) – Optional path to settings for Azure ML Workspace

  • validate_checksum (bool) – Whether to validate the content from HTTP response

Return type

None

health_azure.download_from_datastore(datastore_name, file_prefix, output_folder, aml_workspace=None, workspace_config_path=None, overwrite=False, show_progress=False)[source]

Download file(s) from an Azure ML Datastore that are registered within a given Workspace. The path to the file(s) to be downloaded, relative to the datastore <datastore_name>, is specified by the parameter “prefix”. Azure will search for files within the Datastore whose paths begin with this string. If you wish to download multiple files from the same folder, set <prefix> equal to that folder’s path within the Datastore. If you wish to download a single file, include both the path to the folder it resides in, as well as the filename itself. If the relevant file(s) are found, they will be downloaded to the folder specified by <output_folder>. If this directory does not already exist, it will be created. E.g. if your datastore contains the paths [“foo/bar/1.txt”, “foo/bar/2.txt”] and you call this function with file_prefix=”foo/bar” and output_folder=”outputs”, you would end up with the files [“outputs/foo/bar/1.txt”, “outputs/foo/bar/2.txt”]

If not running inside AML and neither a workspace nor the config file are provided, the code will try to locate a config.json file in any of the parent folders of the current working directory. If that succeeds, that config.json file will be used to instantiate the workspace.

Parameters
  • datastore_name (str) – The name of the Datastore containing the blob to be downloaded. This Datastore itself must be an instance of an AzureBlobDatastore.

  • file_prefix (str) – The prefix to the blob to be downloaded

  • output_folder (Path) – The directory into which the blob should be downloaded

  • aml_workspace (Optional[Workspace]) – Optional Azure ML Workspace object

  • workspace_config_path (Optional[Path]) – Optional path to settings for Azure ML Workspace

  • overwrite (bool) – If True, will overwrite any existing file at the same remote path. If False, will skip any duplicate file.

  • show_progress (bool) – If True, will show the progress of the file download

Return type

None

health_azure.fetch_run(workspace, run_recovery_id)[source]

Finds an existing run in an experiment, based on a recovery ID that contains the experiment ID and the actual RunId. The run can be specified either in the experiment_name:run_id format, or just the run_id.

Parameters
  • workspace (Workspace) – the configured AzureML workspace to search for the experiment.

  • run_recovery_id (str) – The Run to find. Either in the full recovery ID format, experiment_name:run_id or just the run_id

Return type

Run

Returns

The AzureML run.

health_azure.get_most_recent_run(run_recovery_file, workspace)[source]

Gets the name of the most recently executed AzureML run, instantiates that Run object and returns it.

Parameters
  • run_recovery_file (Path) – The path of the run recovery file

  • workspace (Workspace) – Azure ML Workspace

Return type

Run

Returns

The Run

health_azure.get_workspace(aml_workspace=None, workspace_config_path=None)[source]

Retrieve an Azure ML Workspace by going through the following steps:

  1. If the function has been called from inside a run in AzureML, it returns the current AzureML workspace.

  2. If a Workspace object has been provided in the aml_workspace argument, return that.

  3. If a path to a Workspace config file has been provided, load the workspace according to that config file.

  4. If a Workspace config file is present in the current working directory or one of its parents, load the

workspace according to that config file.

  1. If 3 environment variables are found, use them to identify the workspace (HIML_RESOURCE_GROUP,

HIML_SUBSCRIPTION_ID, HIML_WORKSPACE_NAME)

If none of the above succeeds, an exception is raised.

Parameters
  • aml_workspace (Optional[Workspace]) – If provided this is returned as the AzureML Workspace.

  • workspace_config_path (Optional[Path]) – If not provided with an AzureML Workspace, then load one given the information in this config

Return type

Workspace

Returns

An AzureML workspace.

Raises
  • ValueError – If none of the available options for accessing the workspace succeeds.

  • FileNotFoundError – If the workspace config file is given in workspace_config_path, but is not present.

health_azure.health_azure_package_setup()[source]

Set up the Python packages where needed. In particular, reduce the logging level for some of the used libraries, which are particularly talkative in DEBUG mode. Usually when running in DEBUG mode, we want diagnostics about the model building itself, but not for the underlying libraries.

Return type

None

health_azure.is_running_in_azure_ml(aml_run=<azureml.core.run._OfflineRun object>)[source]

Returns True if the given run is inside of an AzureML machine, or False if it is on a machine outside AzureML. When called without arguments, this functions returns True if the present code is running in AzureML. Note that in runs with “compute_target=’local’” this function will also return True. Such runs execute outside of AzureML, but are able to log all their metrics, etc to an AzureML run.

Parameters

aml_run (Run) – The run to check. If omitted, use the default run in RUN_CONTEXT

Return type

bool

Returns

True if the given run is inside of an AzureML machine, or False if it is a machine outside AzureML.

health_azure.object_to_yaml(o)[source]

Converts an object to a YAML string representation. This is done by recursively traversing all attributes and writing them out to YAML if they are basic datatypes.

Parameters

o (Any) – The object to inspect.

Return type

str

Returns

A string in YAML format.

health_azure.set_environment_variables_for_multi_node()[source]

Sets the environment variables that PyTorch Lightning needs for multi-node training.

Return type

None

health_azure.set_logging_levels(levels)[source]

Sets the logging levels for the given module-level loggers.

Parameters

levels (Dict[str, int]) – A mapping from module name to desired logging level.

Return type

None

health_azure.split_recovery_id(id_str)[source]

Splits a run ID into the experiment name and the actual run. The argument can be in the format ‘experiment_name:run_id’, or just a run ID like user_branch_abcde12_123. In the latter case, everything before the last two alphanumeric parts is assumed to be the experiment name.

Parameters

id_str (str) – The string run ID.

Return type

Tuple[str, str]

Returns

experiment name and run name

health_azure.submit_run(workspace, experiment_name, script_run_config, tags=None, wait_for_completion=False, wait_for_completion_show_output=False, display_name=None)[source]

Starts an AzureML run on a given workspace, via the script_run_config.

Parameters
  • workspace (Workspace) – The AzureML workspace to use.

  • experiment_name (str) – The name of the experiment that will be used or created. If the experiment name contains characters that are not valid in Azure, those will be removed.

  • script_run_config (Union[ScriptRunConfig, HyperDriveConfig]) – The settings that describe which script should be run.

  • tags (Optional[Dict[str, str]]) – A dictionary of string key/value pairs, that will be added as metadata to the run. If set to None, a default metadata field will be added that only contains the commandline arguments that started the run.

  • wait_for_completion (bool) – If False (the default) return after the run is submitted to AzureML, otherwise wait for the completion of this run (if True).

  • wait_for_completion_show_output (bool) – If wait_for_completion is True this parameter indicates whether to show the run output on sys.stdout.

  • display_name (Optional[str]) – The name for the run that will be displayed in the AML UI. If not provided, a random display name will be generated by AzureML.

Return type

Run

Returns

An AzureML Run object.

health_azure.submit_to_azure_if_needed(compute_cluster_name='', entry_script=None, aml_workspace=None, workspace_config_file=None, ml_client=None, snapshot_root_directory=None, script_params=None, conda_environment_file=None, aml_environment_name='', experiment_name=None, environment_variables=None, pip_extra_index_url='', private_pip_wheel_path=None, docker_base_image='mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.3-cudnn8-ubuntu20.04:20230509.v1', docker_shm_size='100g', ignored_folders=None, default_datastore='', input_datasets=None, output_datasets=None, num_nodes=1, wait_for_completion=False, wait_for_completion_show_output=False, max_run_duration='', submit_to_azureml=None, tags=None, after_submission=None, hyperdrive_config=None, hyperparam_args=None, strictly_aml_v1=False, identity_based_auth=False, pytorch_processes_per_node_v2=None, use_mpi_run_for_single_node_jobs=True, display_name=None)[source]

Submit a folder to Azure, if needed and run it. Use the commandline flag –azureml to submit to AzureML, and leave it out to run locally.

Parameters
  • after_submission (Union[Callable[[Run], None], Callable[[Job, MLClient], None], None]) – A function that will be called directly after submitting the job to AzureML. Use this to, for example, add additional tags or print information about the run. When using AzureML SDK V1, the only argument to this function is the Run object that was just submitted. When using AzureML SDK V2, the arguments are (Job, MLClient).

  • tags (Optional[Dict[str, str]]) – A dictionary of string key/value pairs, that will be added as metadata to the run. If set to None, a default metadata field will be added that only contains the commandline arguments that started the run.

  • aml_environment_name (str) – The name of an AzureML environment that should be used to submit the script. If not provided, an environment will be created from the arguments to this function.

  • max_run_duration (str) – The maximum runtime that is allowed for this job in AzureML. This is given as a floating point number with a string suffix s, m, h, d for seconds, minutes, hours, day. Examples: ‘3.5h’, ‘2d’

  • experiment_name (Optional[str]) – The name of the AzureML experiment in which the run should be submitted. If omitted, this is created based on the name of the current script.

  • entry_script (Union[Path, str, None]) – The script that should be run in AzureML

  • compute_cluster_name (str) – The name of the AzureML cluster that should run the job. This can be a cluster with CPU or GPU machines.

  • conda_environment_file (Union[Path, str, None]) – The conda configuration file that describes which packages are necessary for your script to run.

  • aml_workspace (Optional[Workspace]) – There are two optional parameters used to glean an existing AzureML Workspace. The simplest is to pass it in as a parameter.

  • workspace_config_file (Union[Path, str, None]) – The 2nd option is to specify the path to the config.json file downloaded from the Azure portal from which we can retrieve the existing Workspace.

  • ml_client (Optional[MLClient]) – An Azure MLClient object for interacting with Azure resources.

  • snapshot_root_directory (Union[Path, str, None]) – The directory that contains all code that should be packaged and sent to AzureML. All Python code that the script uses must be copied over.

  • ignored_folders (Optional[List[Union[Path, str]]]) – A list of folders to exclude from the snapshot when copying it to AzureML.

  • script_params (Optional[List[str]]) – A list of parameters to pass on to the script as it runs in AzureML. If None (the default), these will be copied over from sys.argv (excluding the –azureml flag, if found).

  • environment_variables (Optional[Dict[str, str]]) – The environment variables that should be set when running in AzureML.

  • docker_base_image (str) – The Docker base image that should be used when creating a new Docker image. The list of available images can be found here: https://github.com/Azure/AzureML-Containers The default image is mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn8-ubuntu18.04

  • docker_shm_size (str) – The Docker shared memory size that should be used when creating a new Docker image. Default value is ‘100g’.

  • pip_extra_index_url (str) – If provided, use this PIP package index to find additional packages when building the Docker image.

  • private_pip_wheel_path (Union[Path, str, None]) – If provided, add this wheel as a private package to the AzureML workspace.

  • default_datastore (str) – The data store in your AzureML workspace, that points to your training data in blob storage. This is described in more detail in the README.

  • input_datasets (Optional[List[Union[str, DatasetConfig]]]) – The script will consume all data in folder in blob storage as the input. The folder must exist in blob storage, in the location that you gave when creating the datastore. Once the script has run, it will also register the data in this folder as an AzureML dataset.

  • output_datasets (Optional[List[Union[str, DatasetConfig]]]) – The script will create a temporary folder when running in AzureML, and while the job writes data to that folder, upload it to blob storage, in the data store.

  • num_nodes (int) – The number of nodes to use in distributed training on AzureML. When using a value > 1, multiple nodes in AzureML will be started. If pytorch_processes_per_node_v2=None, the job will be submitted as a multi-node MPI job, with 1 process per node. This is suitable for PyTorch Lightning jobs. If pytorch_processes_per_node_v2 is not None, a job with framework “PyTorch” and communication backend “nccl” will be started. pytorch_processes_per_node_v2 will guide the number of processes per node. This is suitable for plain PyTorch training jobs without the use of frameworks like PyTorch Lightning.

  • wait_for_completion (bool) – If False (the default) return after the run is submitted to AzureML, otherwise wait for the completion of this run (if True).

  • wait_for_completion_show_output (bool) – If wait_for_completion is True this parameter indicates whether to show the run output on sys.stdout.

  • submit_to_azureml (Optional[bool]) – If True, the codepath to create an AzureML run will be executed. If False, the codepath for local execution (i.e., return immediately) will be executed. If not provided (None), submission to AzureML will be triggered if the commandline flag ‘–azureml’ is present in sys.argv

  • hyperdrive_config (Optional[HyperDriveConfig]) – A configuration object for Hyperdrive (hyperparameter search).

  • strictly_aml_v1 (bool) – If True, use Azure ML SDK v1. Otherwise, attempt to use Azure ML SDK v2.

  • pytorch_processes_per_node_v2 (Optional[int]) – For plain PyTorch multi-GPU processing: The number of processes per node. This is only supported with AML SDK v2, and ignored in v1. If supplied, the job will be submitted as using the “pytorch” framework (rather than “Python”), and using “nccl” as the communication backend.

  • use_mpi_run_for_single_node_jobs (bool) – If True, even single node jobs with SDK v2 will be run as distributed MPI jobs. This is required for Kubernetes compute. If False, single node jobs will not be run as distributed jobs. This setting only affects jobs submitted with SDK v2 (when strictly_aml_v1=False)

  • display_name (Optional[str]) – The name for the run that will be displayed in the AML UI. If not provided, a random display name will be generated by AzureML.

Return type

AzureRunInfo

Returns

If the script is submitted to AzureML then we terminate python as the script should be executed in AzureML, otherwise we return a AzureRunInfo object.

health_azure.torch_barrier()[source]

This is a barrier to use in distributed jobs. Use it to make all processes that participate in a distributed pytorch job to wait for each other. When torch.distributed is not set up or not found, the function exits immediately.

Return type

None

health_azure.upload_to_datastore(datastore_name, local_data_folder, remote_path, aml_workspace=None, workspace_config_path=None, overwrite=False, show_progress=False)[source]

Upload a folder to an Azure ML Datastore that is registered within a given Workspace. Note that this will upload all files within the folder, but will not copy the folder itself. E.g. if you specify the local_data_dir=”foo/bar” and that contains the files [“1.txt”, “2.txt”], and you specify the remote_path=”baz”, you would see the following paths uploaded to your Datastore: [“baz/1.txt”, “baz/2.txt”]

If not running inside AML and neither a workspace nor the config file are provided, the code will try to locate a config.json file in any of the parent folders of the current working directory. If that succeeds, that config.json file will be used to instantiate the workspace.

Parameters
  • datastore_name (str) – The name of the Datastore to which the blob should be uploaded. This Datastore itself must be an instance of an AzureBlobDatastore

  • local_data_folder (Path) – The path to the local directory containing the data to be uploaded

  • remote_path (Path) – The path to which the blob should be uploaded

  • aml_workspace (Optional[Workspace]) – Optional Azure ML Workspace object

  • workspace_config_path (Optional[Path]) – Optional path to settings for Azure ML Workspace

  • overwrite (bool) – If True, will overwrite any existing file at the same remote path. If False, will skip any duplicate files and continue to the next.

  • show_progress (bool) – If True, will show the progress of the file download

Return type

None

health_azure.write_yaml_to_object(o, yaml_string, strict=False)[source]

Writes a serialized object in YAML format back into an object, assuming that the attributes of the object and the YAML field names are in sync.

Parameters

strict (bool) – If True, any mismatch of field names will raise a ValueError. If False, only a warning will be

printed. Note that the object may have been modified even if an error is raised. :type o: Any :param o: The object to write to. :type yaml_string: str :param yaml_string: A YAML formatted string with attribute names and values.

Return type

None

Functions

create_aml_run_object(experiment_name[, …])

Creates an AzureML Run object in the given workspace, or in the workspace given by the AzureML config file.

create_run_configuration(workspace, …[, …])

Creates an AzureML run configuration, that contains information about environment, multi node execution, and Docker.

create_script_run(script_params[, …])

Creates an AzureML ScriptRunConfig object, that holds the information about the snapshot, the entry script, and its arguments.

download_files_from_run_id(run_id, output_folder)

For a given Azure ML run id, first retrieve the Run, and then download all files, which optionally start with a given prefix.

download_checkpoints_from_run_id(run_id, …)

Given an Azure ML run id, download all files from a given checkpoint directory within that run, to the path specified by output_path.

download_from_datastore(datastore_name, …)

Download file(s) from an Azure ML Datastore that are registered within a given Workspace.

fetch_run(workspace, run_recovery_id)

Finds an existing run in an experiment, based on a recovery ID that contains the experiment ID and the actual RunId.

get_most_recent_run(run_recovery_file, workspace)

Gets the name of the most recently executed AzureML run, instantiates that Run object and returns it.

get_workspace([aml_workspace, …])

Retrieve an Azure ML Workspace by going through the following steps:

is_running_in_azure_ml([aml_run])

Returns True if the given run is inside of an AzureML machine, or False if it is on a machine outside AzureML.

set_environment_variables_for_multi_node()

Sets the environment variables that PyTorch Lightning needs for multi-node training.

split_recovery_id(id_str)

Splits a run ID into the experiment name and the actual run.

submit_run(workspace, experiment_name, …)

Starts an AzureML run on a given workspace, via the script_run_config.

submit_to_azure_if_needed([…])

Submit a folder to Azure, if needed and run it.

torch_barrier()

This is a barrier to use in distributed jobs.

upload_to_datastore(datastore_name, …[, …])

Upload a folder to an Azure ML Datastore that is registered within a given Workspace.

create_crossval_hyperdrive_config(num_splits)

Creates an Azure ML HyperDriveConfig object for running cross validation.

aggregate_hyperdrive_metrics(child_run_arg_name)

For a given HyperDriveRun object, or id of a HyperDriveRun, retrieves the metrics from each of its children and then aggregates it. Optionally filters the metrics logged in the Run, by providing a list of metrics to keep. Returns a DataFrame where each column is one child run, and each row is a metric logged by that child run. For example, for a HyperDrive run with 2 children, where each logs epoch, accuracy and loss, the result would look like::.

object_to_yaml(o)

Converts an object to a YAML string representation.

write_yaml_to_object(o, yaml_string[, strict])

Writes a serialized object in YAML format back into an object, assuming that the attributes of the object and the YAML field names are in sync.

health_azure_package_setup()

Set up the Python packages where needed.

set_logging_levels(levels)

Sets the logging levels for the given module-level loggers.

Classes

AzureRunInfo(input_datasets, …)

This class stores all information that a script needs to run inside and outside of AzureML.

DatasetConfig(name[, datastore, …])

Contains information to use AzureML datasets as inputs or outputs.

health_ml Package

class health_ml.Runner(project_root)[source]

This class contains the high-level logic to start a training run: choose a model configuration by name, submit to AzureML if needed, or otherwise start the actual training and test loop.

Parameters

project_root (Path) – The root folder that contains all of the source code that should be executed.

additional_run_tags(script_params)[source]

Gets the set of tags that will be added to the AzureML run as metadata.

Parameters

script_params (List[str]) – The commandline arguments used to invoke the present script.

Return type

Dict[str, str]

parse_and_load_model()[source]

Parses the command line arguments, and creates configuration objects for the model itself, and for the Azure-related parameters. Sets self.experiment_config to its proper values. Returns the parser output from parsing the model commandline arguments.

Return type

ParserResult

Returns

ParserResult object containing args, overrides and settings

run()[source]

The main entry point for training and testing models from the commandline. This chooses a model to train via a commandline argument, runs training or testing, and writes all required info to disk and logs.

Return type

Tuple[LightningContainer, AzureRunInfo]

Returns

a tuple of the LightningContainer object and an AzureRunInfo containing all information about the present run (whether running in AzureML or not)

run_in_situ(azure_run_info)[source]

Actually run the AzureML job; this method will typically run on an Azure VM.

Parameters

azure_run_info (AzureRunInfo) – Contains all information about the present run in AzureML, in particular where the

datasets are mounted.

Return type

None

submit_to_azureml_if_needed()[source]

Submit a job to AzureML, returning the resulting Run object, or exiting if we were asked to wait for completion and the Run did not succeed.

Return type

AzureRunInfo

Returns

an AzureRunInfo object containing all of the details of the present run. If AzureML is not specified, the attribute ‘run’ will None, but the object still contains helpful information about datasets etc

validate()[source]

Runs sanity checks on the whole experiment.

Return type

None

class health_ml.TrainingRunner(experiment_config, container, project_root=None)[source]

Driver class to run an ML experiment. Note that the project root argument MUST be supplied when using hi-ml as a package!

Parameters
  • experiment_config (ExperimentConfig) – The ExperimentConfig object to use for training.

  • container (LightningContainer) – The LightningContainer object to use for training.

  • project_root (Optional[Path]) – Project root. This should only be omitted if calling run_ml from the test suite. Supplying it is crucial when using hi-ml as a package or submodule!

after_ddp_cleanup(environ_before_training)[source]

Run processes cleanup after ddp context to prepare for single device inference. Kill all processes in DDP besides rank 0.

Return type

None

end_training(environ_before_training)[source]

Cleanup after training is done. This is called after the trainer has finished fitting the data. This is called to update the checkpoint handler state and remove redundant checkpoint files. If running inference on a single device, this is also called to kill all processes besides rank 0.

Return type

None

get_data_module()[source]

Reads the datamodule that should be used for training or valuation from the container. This must be overridden in subclasses.

Return type

LightningDataModule

init_inference()[source]

Prepare the trainer for running inference on the validation and test set. This chooses a checkpoint, initializes the PL Trainer object, and chooses the right data module. The hook for running inference on the validation set is run (LightningContainer.on_run_extra_validation_epoch) is first called to reflect any changes to the model or datamodule states before running inference.

Return type

None

init_training()[source]

Execute some bookkeeping tasks only once if running distributed and initialize the runner’s trainer object.

Return type

None

is_crossval_disabled_or_child_0()[source]

Returns True if the present run is a non-cross-validation run, or child run 0 of a cross-validation run.

Return type

bool

run()[source]

Driver function to run a ML experiment

Return type

None

run_training()[source]

The main training loop. It creates the Pytorch model based on the configuration options passed in, creates a Pytorch Lightning trainer, and trains the model. If a checkpoint was specified, then it loads the checkpoint before resuming training. The cwd is changed to the outputs folder so that the model can write to current working directory, and still everything is put into the right place in AzureML (only the contents of the “outputs” folder is treated as a result file).

Return type

None

run_validation()[source]

Run validation on the validation set for all models to save time/memory consuming outputs. This is done in inference only mode or when the user has requested an extra validation epoch. The cwd is changed to the outputs folder

Return type

None

Classes

TrainingRunner(experiment_config, container)

Driver class to run an ML experiment.

Runner(project_root)

This class contains the high-level logic to start a training run: choose a model configuration by name, submit to AzureML if needed, or otherwise start the actual training and test loop.