HPCExecutor - Run REF on HPCs#
High Performance Computing Centers (HPCs) generally do not let users run computationally intensive programs and apps on their login nodes. Users must submit a batch or interactive job into a queue that provisions resources to run user programs. Therefore, the workflow for running REF on HPCs differs from that on users' local computers or workstations.
Here, we introduce the HPCExecutor for running REF on HPC resources.
You could use HPCExecutor if:
- The login nodes allow users to run a program for a long time like several hours with little computational resources (less than 25% of one CPU core and negligible memory).
- You want to run REF under the HPC workflow i.e., submitting batch jobs.
- The scheduler on your HPC is slurm or pbs. We may include other schedules in the future if needed. Please make an issue describing your requirements.
Pre-requirements#
Since the compute nodes on HPCs generally do not have an external internet connection, please pre-install the conda virtual environments and pre-fetch and pre-ingest all data needed by the REF on the login node before running the REF on compute nodes using HPCExecutor. Also, please clean (delete) everything under ${HOME}/.cache/mamba/proc/. These files are used by the micromamba intalled that were required by ESMValTool and PMP.
HPCExecutor#
The HPCExecutor will use the slurm provider and srun launch from parsl to submit jobs and run REF on compute nodes. Blocks are the basic computational units of parsl. It could include a single node or several compute nodes. Within a node, parsl creates several workers that run a python app across multiple cores. Each worker will compute one or several diagnostics. If the number of diagnostics is larger than the worker number, some workers will run more than one diagnostics serially. However, the more workers, the more overhead required by parsl. The steps to use HPCExecutor are as follows:
- On the login nodes, if possible, open a
screenortmuxsession to keep the master process of HPCExecutor alive. Only the master process will be run on the login nodes with less than 10% of one CPU core and negligible memory, and real computations will be on compute nodes through the job submissions. If the HPC's queue time is short and the connection to the login nodes is stable, users can use the login nodes directly without using the sessions ofscreenandtmux. - Edit the
ref.tomlunder the config directory if not create one. - Change the ref executor to HPCExecutor as follows:
- Add the configuration or options for the HPCExecutor based on your system and your account. The example of configuration used in the DOE Perlmutter HPC is as follows: If you are using NCI Gadi, you can use the following configuration:
- Run
ref solvein the session ofscreenortmuxor directly on the login nodes. It will have a master process running on the login nodes to monitor and submit jobs to run the real diagnostics on compute nodes. - Check the logs under the
runinfodirectory and${REF_CONFIGURATION}/logto see if there are any errors. Once the master process is done, please runref executions list-groupsto check the results orref executions inspect idto see the error message from the providers.
Note
If more sbatch derivatives are needed, they could be added in the scheduler_options with \n as a separator.
There are other configurations available, for example:
log_dirto save the parsl log and stderr and stdout from the job;cores_per_workerits default is 1, but if the providers use dask or OpenMP, its value can be bigger;mem_per_workergenerally do not need to set;max_workers_per_nodeto set the maximum worker in a node;overridersto add additional options forsrun
The configurations of the HPCExecutor for the slurm scheduler are validated now. They include:
scheduler: str, the HPC job scheduler name, eitherslurmorpbsaccount: str, the account name to be chargedusername: str, the name of the user to run the REFpartition: str, the slurm partition nameqos: str, the slurm quality of servicereq_nodes: int, requested node for the REF runvalidation: boolean, true to validate the above options with pyslurm
The following HPCExecutor options are directly from parsl. Please refer to link
log_dir: str, default="run_info"cores_per_worker: int, default=1mem_per_worker: floatmax_workers_per_node: int, default=16walltime: str, default="00:30:00"scheduler_options: strretries: int, default=2max_blocks: int, default=1worker_init: stroverrides: strcmd_timeout: int, default=120cpu_affinity: str, default="none"
Performance benchmarking#
Due to the HPCExecutor parallelism distributing the computation across workers, performance is generally determined by the slowest diagnostic on a single core and the overhead introduced by parsl. However, the HPCExecutor should have good scalability with an increase in the number of diagnostics and fine grain parallelism (use of dask, OpenMP, and MPI) implemented by diagnostic providers in the future. From the following table, the overhead of parsl is almost negligible as the number of workers increases.
Single node performance (58 diagnostics)#
Determining the optimal number of workers and cores per worker is crucial for performance. The following table shows the performance of the HPCExecutor on the NERSC Perlmutter system with 58 diagnostics, using different numbers of workers and cores per worker.
Please run your own benchmarks to find the optimal configuration for your system and diagnostics.
| HPC | Executor | CPUs | No. of Workers | No. of Cores/Worker | Time (minutes) |
|---|---|---|---|---|---|
| NERSC | LocalExecutor | AMD EPYC 7763 | 58 | 1 | 18.2 |
| NERSC | HPCExecutor | AMD EPYC 7763 | 64 | 1 | 16.3 |
| NERSC | HPCExecutor | AMD EPYC 7763 | 32 | 1 | 18.1 |
| NERSC | HPCExecutor | AMD EPYC 7763 | 16 | 1 | 28.6 |