HTC/HPC for Stock Assessments

Operationalizing available research computing resources for stock assessment

Nicholas Ducharme-Barth & Megumi Oshima

2024-11-05

What?

Research computing is the collection of computing, software, storage resources and services that allows for data analysis at scale.

In our particular case we are interested in leverging research computing to augment stock assessment worflows.

Run more/bigger models in less time

Why?

Improve efficiency by running 10s - 1000s of models ‘simultaneously’.

2021 Southwest Pacific Ocean swordfish stock assessment

9,300 model runs totalling ~46 months of computation time.

Why?

Efficiency

Knowledge acquisition

Automation, transparency, reproducibility & portability

Multi-model inference

Software containers

Better science

How?

High-throughput computing (HTC)

Set-up to handle running many jobs simultaneously

Ideal for running short, small, independent (embarrassingly parallel) jobs.

High-performance computing (HPC)

Can handle HTC workflows (in theory)

Can also handle long running, large, multi-processor jobs (true parallel processing)

2024 North Pacific shortfin mako shark assessment: Used HTC resources to complete ~4 months months of Bayesian simulation-estimation evaluations (18,000 model runs using RStan) in ~3 hours (1027 \(\times\) faster) during a working group meeting.

Example: Fitting large spatiotemporal model in R using TMB required 128 CPUs & 1TB RAM.

How?

Available resources

High-throughput computing (HTC)

High-performance computing (HPC)

Photo credit: NOAA

OpenScienceGrid (OSG): OSPool

NOAA Hera

How?

OpenScienceGrid (OSG)

Uses HTCondor distributed computing network (no shared file system between compute nodes) to implement HTC workflows

Free to use for US based researchers affiliated with academic/government organization and using OSG for research/education efforts

Should not be used to analyze protected data

NOAA Hera

Uses Slurm to schedule HPC (or HTC) workflows

Shared file system between compute nodes

Allocation determines access

NOAA resource so no restrictions on acceptable use/analyzing protected data if working on mission related tasks

Both use software containers

Software containers

Many may already be using containers in existing cloud-based workspaces such as GitHub Codespaces or Posit Workbench

Application: set up identical, custom software environments on OSG and Hera

Application: use to “version” analyses by “freezing” packages/libraries

Software containers

Apptainer

Secure, portable and reproducible software container for Linux operating systems

Easy to use

Doesn’t require root privileges to build making it ideal for HTC/HPC environments

Plays nice with existing containers (e.g., Docker)

Apptainer

Let’s look at an example (linux-r4ss-v4.def):

Bootstrap: docker
From: ubuntu:20.04

%post
    TZ=Etc/UTC && \
    ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && \
    echo $TZ > /etc/timezone
    apt update -y
    apt install -y \
        tzdata \
        curl \
        dos2unix

    apt-get update -y
    apt-get install -y \
            build-essential \
            cmake \
            g++ \
            libssl-dev \
            libssh2-1-dev \
            libcurl4-openssl-dev \
            libfontconfig1-dev \
            libxml2-dev \
            libgit2-dev \
            wget \
            tar \
            coreutils \
            gzip \
            findutils \
            sed \
            gdebi-core \
            locales \
            nano
    
    locale-gen en_US.UTF-8

    export R_VERSION=4.4.0
    curl -O https://cdn.rstudio.com/r/ubuntu-2004/pkgs/r-${R_VERSION}_1_amd64.deb
    gdebi -n r-${R_VERSION}_1_amd64.deb

    ln -s /opt/R/${R_VERSION}/bin/R /usr/local/bin/R
    ln -s /opt/R/${R_VERSION}/bin/Rscript /usr/local/bin/Rscript

    R -e "install.packages('remotes', dependencies=TRUE, repos='http://cran.rstudio.com/')"
    R -e "install.packages('data.table', dependencies=TRUE, repos='http://cran.rstudio.com/')"
    R -e "install.packages('magrittr', dependencies=TRUE, repos='http://cran.rstudio.com/')"
    R -e "install.packages('mvtnorm', dependencies=TRUE, repos='http://cran.rstudio.com/')"
    R -e "remotes::install_github('r4ss/r4ss')"
    R -e "remotes::install_github('PIFSCstockassessments/ss3diags')"

    NOW=`date`
    echo 'export build_date=$NOW' >> $SINGULARITY_ENVIRONMENT

    mkdir -p /ss_exe
    curl -L -o /ss_exe/ss3_linux https://github.com/nmfs-ost/ss3-source-code/releases/download/v3.30.22.1/ss3_linux
    chmod 755 /ss_exe/ss3_linux

%environment
    export PATH=/ss_exe:$PATH
    
%labels
    Author nicholas.ducharme-barth@noaa.gov
    Version v0.0.4

%help
    This is a Linux (Ubuntu 20.04) container containing Stock Synthesis (version 3.30.22.1), R (version 4.4.0) and the R packages r4ss, ss3diags, data.table, magrittr, and mvtnorm.

Apptainer

Let’s look at an example (linux-r4ss-v4.def):

Build on Linux system with Apptainer installed.

apptainer build linux-r4ss-v4.sif linux-r4ss-v4.def

Let’s walk through an example

In this case we will use NOAA Hera to conduct a quick retrospective analysis of all models in the Stock Synthesis (SS3) testing suite.

More complete documentation of this example can be found on our GitHub website.

Workflow

Create container

Create files/scripts

Upload files

Submit jobs

Download files back to local machine

Workflow - Create files/scripts

IDE: Develop and make job files/scripts

Important

Note that you will need to replace User.Name with your actual NOAA RDHPCS user name and project_name with your specific Hera project name in the following code.

Workflow - Create files/scripts

0. Text file specifying directories to run jobs in.

hera_job_directories.txt

/scratch1/NMFS/project_name/User.Name/examples/ss3/output/01-BigSkate_2019-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/02-Empirical_Wtatage_Age_Selex-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/03-growth_timevary-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/04-Hake_2018-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/05-Hake_2019_semi_parametric_selex-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/06-platoons-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/07-Sablefish2015-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/08-Simple-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/09-Simple_Lorenzen_tv_trend-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/10-Simple_NoCPUE-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/11-Simple_with_Discard-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/12-Simple_with_DM_sizefreq-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/13-Spinydogfish_2011-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/14-tagging_mirrored_sel-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/15-three_area_nomove-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/16-two_morph_seas_areas-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/17-vermillion_snapper-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/18-vermillion_snapper_F4-0/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/19-BigSkate_2019-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/20-Empirical_Wtatage_Age_Selex-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/21-growth_timevary-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/22-Hake_2018-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/23-Hake_2019_semi_parametric_selex-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/24-platoons-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/25-Sablefish2015-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/26-Simple-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/27-Simple_Lorenzen_tv_trend-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/28-Simple_NoCPUE-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/29-Simple_with_Discard-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/30-Simple_with_DM_sizefreq-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/31-Spinydogfish_2011-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/32-tagging_mirrored_sel-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/33-three_area_nomove-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/34-two_morph_seas_areas-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/35-vermillion_snapper-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/36-vermillion_snapper_F4-1/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/37-BigSkate_2019-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/38-Empirical_Wtatage_Age_Selex-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/39-growth_timevary-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/40-Hake_2018-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/41-Hake_2019_semi_parametric_selex-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/42-platoons-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/43-Sablefish2015-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/44-Simple-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/45-Simple_Lorenzen_tv_trend-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/46-Simple_NoCPUE-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/47-Simple_with_Discard-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/48-Simple_with_DM_sizefreq-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/49-Spinydogfish_2011-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/50-tagging_mirrored_sel-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/51-three_area_nomove-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/52-two_morph_seas_areas-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/53-vermillion_snapper-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/54-vermillion_snapper_F4-2/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/55-BigSkate_2019-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/56-Empirical_Wtatage_Age_Selex-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/57-growth_timevary-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/58-Hake_2018-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/59-Hake_2019_semi_parametric_selex-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/60-platoons-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/61-Sablefish2015-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/62-Simple-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/63-Simple_Lorenzen_tv_trend-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/64-Simple_NoCPUE-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/65-Simple_with_Discard-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/66-Simple_with_DM_sizefreq-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/67-Spinydogfish_2011-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/68-tagging_mirrored_sel-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/69-three_area_nomove-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/70-two_morph_seas_areas-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/71-vermillion_snapper-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/72-vermillion_snapper_F4-3/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/73-BigSkate_2019-4/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/74-Empirical_Wtatage_Age_Selex-4/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/75-growth_timevary-4/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/76-Hake_2018-4/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/77-Hake_2019_semi_parametric_selex-4/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/78-platoons-4/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/79-Sablefish2015-4/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/80-Simple-4/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/81-Simple_Lorenzen_tv_trend-4/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/82-Simple_NoCPUE-4/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/83-Simple_with_Discard-4/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/84-Simple_with_DM_sizefreq-4/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/85-Spinydogfish_2011-4/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/86-tagging_mirrored_sel-4/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/87-three_area_nomove-4/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/88-two_morph_seas_areas-4/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/89-vermillion_snapper-4/
/scratch1/NMFS/project_name/User.Name/examples/ss3/output/90-vermillion_snapper_F4-4/

Workflow - Create files/scripts

1. Prepare files for Slurm job execution, specify job requirements and submit the parallel jobs.

parallel-submit.sh

#!/bin/bash

# prep files for slurm job execution
mkdir ./logs
dos2unix ./inputs/hera_job_directories.txt
dos2unix ./slurm_scripts/parallel-job-exec.sh
chmod 777 ./slurm_scripts/parallel-job-exec.sh

# make directory structure
# recursively (mkdir -p) makes a new directory for each line in hera_job_directories.txt
xargs -d '\n' mkdir -p -- < ./inputs/hera_job_directories.txt

# Slurm job submission variables
# -A project name
# -t time requested (minutes)
# -q queue type: batch (billed allocation) or windfall
# -N nodes requested (leave at 1 since requesting additional nodes with each line)
# -j number of jobs to run in parallel per node (restricted by number of total CPUs and available RAM)
# seq job ids to run on each node (this can be greater than -j but only -j will be run at a time so the more job ids assigned to the node the longer they will wait to be executed)
sbatch -A project_name -t 60 -q batch -N 1 --wrap 'set -x; parallel -j 30 -S `scontrol show hostnames "$SLURM_JOB_NODELIST"|paste -sd,` `pwd`/slurm_scripts/parallel-job-exec.sh `pwd` ::: `seq 0 29`; report-mem'
sbatch -A project_name -t 60 -q batch -N 1 --wrap 'set -x; parallel -j 30 -S `scontrol show hostnames "$SLURM_JOB_NODELIST"|paste -sd,` `pwd`/slurm_scripts/parallel-job-exec.sh `pwd` ::: `seq 30 59`; report-mem'
sbatch -A project_name -t 60 -q batch -N 1 --wrap 'set -x; parallel -j 30 -S `scontrol show hostnames "$SLURM_JOB_NODELIST"|paste -sd,` `pwd`/slurm_scripts/parallel-job-exec.sh `pwd` ::: `seq 60 89`; report-mem'

Workflow - Create files/scripts

2. Define variables to be passed to the software container and a bash wrapper script.

parallel-job-exec.sh

#!/bin/bash
# read directories from a file list

pwd; hostname; date

cd $1
export SLURM_ARRAY_TASK_ID=$2

echo $SLURM_ARRAY_TASK_ID

# define current directory
cwd=$(pwd)

# define paths for singularity container
singularity_container=${cwd}/linux-r4ss-v4.sif

# define variables and paths here to avoid hard coding insider the wrapper script
job_wrapper_script=${cwd}/slurm_scripts/wrapper-r.sh
dir_file=${cwd}/inputs/hera_job_directories.txt
r_script=${cwd}/slurm_scripts/ss3-example-calcs.r
input_data_path=${cwd}/inputs/models/
r_script_name=ss3-example-calcs.r

# change permissions on scripts to allow it to run
chmod 777 $job_wrapper_script
dos2unix $job_wrapper_script
chmod 777 $r_script
dos2unix $r_script

# run bash wrapper script within singularity environment
singularity exec $singularity_container $job_wrapper_script $SLURM_ARRAY_TASK_ID $dir_file $r_script $input_data_path $r_script_name >& logs/out-parallel.$SLURM_ARRAY_TASK_ID

Workflow - Create files/scripts

3. Control file I/O to the R script, execute the R script, conduct job timing and package outputs.

wrapper-r.sh

#!/bin/bash
echo "Running on host `hostname`"

# rename variables passed into the script
slurm_array_task_id=$1
dir_file=$2

# create an array with all data directories
line_index=$(($slurm_array_task_id+1))
echo ${line_index}
echo $dir_file
rep_dir=$(sed -n ${line_index}p $dir_file) 
echo $rep_dir

# change to target directory
cd ${rep_dir}

# make working directory
mkdir -p working/
cd working/

# copy files to working/
cp $3 .

# define variables for R script
input_data_path=$4

# begin calcs
start=`date +%s`
Rscript $5 $rep_dir $input_data_path 

# end of calcs book-keeping
end=`date +%s`
runtime=$((end-start))
echo $runtime
echo Start $start >  runtime.txt
echo End $end >> runtime.txt
echo Runtime $runtime >> runtime.txt

# Create empty file so that it does not mess up when repacking tar
touch End.tar.gz
# only pack up certain items
tar -czf End.tar.gz ss_report.RData runtime.txt 
# move tar out of working/
cd ..
mv working/End.tar.gz .
# delete working/
rm -r working/

Workflow - Create files/scripts

4. Calculation script modifies SS3 input files, executes SS3 model run and conducts post-processing of output within R.

ss3-example-calcs.r

# where is the job executing
    print(getwd())

# load packages
    library(r4ss)

# get args from bash environment
    args = commandArgs(trailingOnly = TRUE)
    print(args)

# get scenario
    scenario = tail(strsplit(args[1],"/")[[1]],n=1)
    model = strsplit(scenario,"-")[[1]][2]
    peel = as.numeric(strsplit(scenario,"-")[[1]][3])

# copy model files
    model_files = list.files(paste0(args[2],model,"/"),full.names=TRUE)
    file.copy(from=model_files,to=getwd())

# modify starter
    tmp_starter = SS_readstarter()
    tmp_starter$retro_yr = -peel

# write files
    SS_writestarter(tmp_starter, overwrite = TRUE)

# run stock synthesis
    run(exe="ss3_linux")

# extract model output
    ss_report = try(SS_output(dir=getwd()),silent=TRUE) 

# save output
    save(ss_report,file="ss_report.RData")

Workflow - Upload files

Terminal A

ssh -m hmac-sha2-256-etm@openssh.com User.Name@hera-rsa.boulder.rdhpcs.noaa.gov -p22

Workflow - Upload files

Terminal A

# navigate to project directory
cd /scratch1/NMFS/project_name/
# create new directory
mkdir User.Name/
# navigate into new directory
cd User.Name/
# create directory for SLURM scripts and logs
mkdir -p examples/ss3/

Workflow - Upload files

Terminal B

scp -o MACs=hmac-sha2-256-etm@openssh.com examples/hera/ss3/upload.example-ss3.tar.gz User.Name@dtn-hera.fairmont.rdhpcs.noaa.gov:/scratch1/NMFS/project_name/User.Name/examples/ss3/
scp -o MACs=hmac-sha2-256-etm@openssh.com apptainer/linux-r4ss-v4.sif User.Name@dtn-hera.fairmont.rdhpcs.noaa.gov:/scratch1/NMFS/project_name/User.Name/examples/ss3/

Workflow - Submit jobs

Terminal A

chmod 777 slurm_scripts/parallel-submit.sh
dos2unix slurm_scripts/parallel-submit.sh
./slurm_scripts/parallel-submit.sh

Workflow - Download results

Terminal B

scp -o MACs=hmac-sha2-256-etm@openssh.com -r User.Name@dtn-hera.fairmont.rdhpcs.noaa.gov:/scratch1/NMFS/project_name/User.Name/examples/ss3/output/ examples/hera/ss3/

Workflow - OSG

Documentation of this example using OSG can be found on our GitHub website.

Biggest difference: no shared file system between compute nodes

Example results

The example ran 90 jobs (18 test models \(\times\) 5 runs each; base + 4 peels), with only one job ‘timing out’ at 1-hour limit.

\(~\) \(~\)

Excluding the job that timed out the 89 jobs run on Hera completed 3.15 hours of calculations (2.12 minutes per job) in an elapsed time of 14.48 minutes or \(\sim\) 13 \(\times\) faster.

Depletion estimates across retrospective peels from the SS3 testing model suite examples. Mohn’s \(\rho\) values are printed in each panel.

Start and stop time for jobs run on Hera, excluding the job that timed out.

Is this ‘time-savings’ worth it?

Benchmark testing

Run the same large job array on both OSG and Hera

Build Apptainer container to replicate an identical software environment on both OSG and Hera

Make sure we can run SS3 and R with non-default packages

Benchmark testing

Using a baseline SS3 file, run 500 alternative models with different fixed parameter values of natural mortality and steepness (uses SS3 and R; r4ss)

Use the delta-MVLN approach to generate uncertainty in predictions so that models can be combined in an ensemble (uses R; ss3diags)

Run 5 retrospective peels (\(t-1\) to \(t-5\)) for each of the 500 models and calculate Mohn’s \(\rho\)

Note: Retrospective peels were treated as separate jobs for the purpose of the benchmark testing giving 3000 unique jobs.

Stock status plots from the model ensemble: A spawning biomass relative to the spawning biomass at MSY, and B fishing mortality relative to the fishing mortality at MSY. The median is showed in the solid line, darker band shows the 50th percentile and lighter band the 80th percentile.

Mohn’s \(\rho\) for alternative parameter combinations of natural mortality (M) and steepness (h). The solid black lines denote the original M (vertical line) and steepness (horizontal line).

Benchmark testing - OSG

All 3000 jobs hit the queue at the same time from a single array job submission and began executing almost immediately

Small fraction of jobs had bad file transfer and had to be relaunched

Total computation time was \(\sim\) 25 days, but elapsed time was \(\sim\) 3.5 hours (166 \(\times\) faster)

Start and stop time for jobs run on OpenScienceGrid (OSG).

Benchmark testing - Hera (as HTC)

Queue and job_array_id limits required multiple (staggered) job submissions

Large proportion of jobs suffered from resource competition during memory/disk intensive portion of SS3 calculations (SD calcs) and produced incomplete outputs.

Note SS3 does not appear to crash/trigger an error when it runs out of memory/disk but instead writes out available output which could appear complete.

Start and stop time for jobs run on NOAA Hera.

Benchmark testing - Hera (parallel)

The Hera workflow was re-configured to run parallel jobs within 15 compute nodes using the gnu parallel utility¹.

All jobs hit the queue at the same time and executed as resources became available within each node.

Each node had 200 jobs allocated to it spread out over 40 CPUs resulting in the \(\sim\) 5 distinct waves of job execution.

Total computation time for 3000 jobs was \(\sim\) 25.5 days, but elapsed time was \(\sim\) 1.75 hours

2998 of 2999 (99.99 %) completed models produced identical output to OSG

Start and stop time for jobs run on NOAA Hera using gnu parallel utility.

Benchmark testing - Summary

We have working proof-of-concept and workflows for both OSG and Hera that can result in substantial time savings

Analyses are portable between computing solutions with minimal modifications to workflows

OSG and Hera have different strengths, weaknesses, and constraints so analysts can choose which may be better suited for different tasks

Is this ‘time-savings’ worth it?

Getting started!

OpenScienceGrid (OSG)

Reach out to OSPool staff and apply for access.
Run some models!

NOAA Hera

Apply for an RDHPCS account at the Account Information Management (AIM) website (CAC login required).
Request access to a RDHPCS project. First time users can request access to the htc4sa project to test out Hera before requesting their own project allocation.
Run some models!

Closing thoughts

Limitations and bottle-necks
What have we not discussed?
What about the cloud?
Integrating with Open Science workflows

Acknowledgements

Thank you to Howard Townsend for helping get the RDHPCS htc4sa project off the ground.

Thank you also to Help Desk staff at both OSG OSPool and NOAA RDHPCS for helping troubleshoot and refine job array workflows using HTCondor and Slurm, respectively.

OSG workflow development and benchmark testing was conducted using services provided by the OSG Consortium (OSG 2006, 2015; Pordes et al. 2007; Sfiligoi et al. 2009), which is supported by the National Science Foundation awards #2030508 and #1836650.

Slide design influenced by Emil Hvitfeldt’s published examples.

Contact us

Nicholas Ducharme-Barth

nicholas.ducharme-barth at noaa.gov

Megumi Oshima

megumi.oshima at noaa.gov

References

OSG. 2006. “OSPool.” OSG. https://doi.org/10.21231/906P-4D78.

———. 2015. “Open Science Data Federation.” OSG. https://doi.org/10.21231/0KVZ-VE57.

Pordes, Ruth, Don Petravick, Bill Kramer, Doug Olson, Miron Livny, Alain Roy, Paul Avery, et al. 2007. “The Open Science Grid.” In J. Phys. Conf. Ser., 78:012057. 78th Series. https://doi.org/10.1088/1742-6596/78/1/012057.

Sfiligoi, Igor, Daniel C Bradley, Burt Holzman, Parag Mhashilkar, Sanjay Padhi, and Frank Wurthwein. 2009. “The Pilot Way to Grid Resources Using glideinWMS.” In 2009 WRI World Congress on Computer Science and Information Engineering, 2:428–32. 2nd Series. https://doi.org/10.1109/CSIE.2009.950.