ssh User.Name@apXX.uc.osg-htc.org
mkdir -p singularity/linux_r4ss
Running an array of SS3 jobs on Hera
Similar to the array_lm example, this example also sets up running an array job on Hera. As before, we will use a *.txt
to indicate which directories we want to run jobs in as a part of our array.
There are a few main differences that serve to illustrate useful modifications to the workflow:
- in this case we will run Stock Synthesis (SS3) on all models in the SS3 testing suite to conduct a retrospective analysis;
- we will set up the job array to run using the gnu parallel utility in order to more effectively implement an HTC type workflow in an HPC system;
- we will run our jobs within a software container;
- lastly, we will define variables to be passed between job submission and job execution scripts to ensure that the correct output is produced.
The hera/ss3 example can be set-up either by cloning the repository git clone https://github.com/MOshima-PIFSC/NSASS-HTC-HPC-Computing.git
, or stepping through the following code:
Note: throughout this tutorial we are using User.Name as a stand-in for your actual username and NMFS/project_name as a stand-in for your project. In all cases, replace User.Name with your actual user name and NMFS/project_name with your specific project name.
1 Build software container
Software containers allow for portable, reproducible research by allowing researchers to set-up a software environment to their exact spefications and can run it on any Linux system. The Apptainer container system is widely used across HPC/HTC systems, and makes it easy to build a container from a definition file. Running a job within a container means that you are able to replicate an identical software environment in any location with Apptainer installed, no matter the native operating system, software and installed packages. The Apptainer container can be built from any Linux machine with Apptainer installed, including the Open Science Grid (OSG) access points. Here we walk through the steps needed to build a Linux (Ubuntu 20.04) container containing Stock Synthesis (version 3.30.22.1), R (version 4.4.0) and the R packages r4ss, ss3diags, data.table, magrittr, and mvtnorm from a definition file, linux-r4ss-v4.def. In this case we will show the steps needed to build the container using the OSG access point as our Linux virtual machine (VM), though this may not be needed if working from an alternative Linux VM.
Note: you will have to change apXX to match your OSG access point (e.g., ap20
or ap21
).
The first step is to log onto your OSG access point via ssh using a Terminal/PowerShell window and make a directory to build your container in. In this case, we are creating the directory singularity
1.
Using a second Terminal/PowerShell window, navigate to the directory that you cloned the NSASS-HTC-HPC-Computing
repo into and upload the definition file (linux-r4ss-v4.def
) to the directory you just created on OSG.
scp apptainer/linux-r4ss-v4.def User.Name@apXX.uc.osg-htc.org:/home/User.Name/singularity/linux_r4ss
Back in your first Terminal/PowerShell window manoeuvre into the directory, and build the container2. The second line of code is what builds the Singularity Image File (.sif
) and takes two arguments: the name of the output .sif
file and the input definition file (.def
).
cd singularity/linux_r4ss
apptainer build linux-r4ss-v4.sif linux-r4ss-v4.def
Using the second Terminal/PowerShell window, download the Singularity Image File (.sif
) so that it can be uploaded for use on the NOAA Hera HPC system.
scp User.Name@apXX.uc.osg-htc.org:/home/User.Name/singularity/linux_r4ss/linux-r4ss-v4.sif apptainer/
2 Setup data inputs and directories
Given that our example is to run a 4-year retrospective analysis for each of the SS3 test models, the next step is downloading the SS3 test models from the nmfs-stock-synthesis/test-models Github repo. Once you’ve downloaded the test models, copy the models/
directory into a new example directory ss3/inputs/
within the NSASS-HTC-HPC-Computing/examples/hera/
directory on your machine. If you cloned the NSASS-HTC-HPC-Computing
repo, the SS3 test models will already be in the correct location.
For the sake of example the job array will be set-up to run each retrospective peel (e.g., -0 years, -1 year, … , -4 years of data) as individual jobs in the job array. This is more efficient in a true HTC environment such as OSG however on Hera it could make more sense to bundle the initial model run and subsequent retrospective peels as a single job. We will store the results of each retrospective peel in its own directory. The directories on Hera will be listed in a text file, and we will use this text file to launch jobs on Hera (as a part of the job array) in each of the named directories.
Let us define that text file using R.
- Define a relative path, we are starting from the root directory of this project.
Show code used to define relative paths.
= this.path::this.proj()
proj_dir = "NMFS/project_name/User.Name/" hera_project
- Write a text file containing the full path names for where the directories will be on Hera.
Show code used to define job directory structure.
=list.dirs(paste0(proj_dir,"/examples/hera/ss3/inputs/models/"),recursive=FALSE,full.names=FALSE)
test_models=0:4
retro_peels
# replace '-' with '_' in model names since we will use '-' as a delimiter
if(length(grep("-",test_models,fixed=TRUE))>0){
= gsub("-","_",test_models)
test_models_new = grep("-",test_models,fixed=TRUE)
rename_models_idx for(i in seq_along(rename_models_idx)){
# create new dir
dir.create(paste0(proj_dir,"/examples/hera/ss3/inputs/models/",test_models_new[rename_models_idx[i]]),recursive=TRUE)
# copy files
file.copy(paste0(proj_dir,"/examples/hera/ss3/inputs/models/",test_models[rename_models_idx[i]],"/",list.files(paste0(proj_dir,"/examples/hera/ss3/inputs/models/",test_models[rename_models_idx[i]]),full.names=FALSE,recursive=FALSE)),paste0(proj_dir,"/examples/hera/ss3/inputs/models/",test_models_new[rename_models_idx[i]]))
# delete old dir
# file.remove(paste0(proj_dir,"/examples/hera/ss3/inputs/models/",test_models[rename_models_idx[i]],"/"))
shell(paste0("powershell rm -r ",proj_dir,"/examples/hera/ss3/inputs/models/",test_models[rename_models_idx[i]],"/"))
}= test_models_new
test_models
}
# define scenarios
= expand.grid(model=test_models,peel=retro_peels)
scenario_df $run_id = 1:nrow(scenario_df)
scenario_df= scenario_df[,c(3,1,2)]
scenario_df $run_id = ifelse(scenario_df$run_id<10,paste0(0,scenario_df$run_id),as.character(scenario_df$run_id))
scenario_df
# write text file
= paste0("/scratch1/", hera_project, "examples/ss3/output/", apply(scenario_df,1,paste0,collapse="-"), "/")
hera_dir_lines writeLines(hera_dir_lines, con=paste0(proj_dir, "/examples/hera/ss3/inputs/hera_job_directories.txt"))
3 Prepare job scripts
During benchmark testing, issues were identified when trying to apply an HTC workflow to Hera, and a seperate workflow was developed which may be more Hera/Slurm appropriate. This takes advantage of the gnu parallel utility to run batches of jobs on distinct compute nodes. In this particular example 90 models will be run across 3 nodes each using 30 CPUs, and we will set the maximum run time to 1 hour. This reserves the entire node for computations thus reducing the competition for resources for any one job3. In order to execute this workflow, instructions are coordinated using four nested scripts:
parallel-submit.sh
: This script prepares files for Slurm job execution, makes the directory structure specified byhera_job_directories.txt
, specifies the job requirements and submits the parallel jobs.parallel-job-exec.sh
: This is a script that defines variables to be passed to the software container and a second bash scriptwrapper-r.sh
.wrapper-r.sh
: This wrapper script controls file input/output to and from the R scriptss3-example-calcs.r
, executes the R script, conducts job timing and tidies up the job working directory.ss3-example-calcs.r
: This is the actual computation script which modifies the SS3 input files as needed, executes the appropriate SS3 model run and conducts any needed post-processing of the output within R.
In parallel-submit.sh
you will need to change the following before you upload and run the script:
- Line 20-22: change account
project_name
to the name of your project. If you are using the NOAA htc4sa project, it would behtc4sa
.
- From within R, compress the
ss3/inputs/
andss3/slurm_scripts/
directories as a tar.gz fileupload.example-ss3.tar.gz
. This simplifies the number of steps needed for file transfers.
shell(paste0("powershell cd ", file.path(proj_dir, "examples", "hera", "ss3"), ";tar -czf upload.example-ss3.tar.gz inputs/ slurm_scripts/"))
4 Hera workflow
- Connect to Hera
Open a PowerShell terminal and connect to Hera. This terminal will be your remote workstation, call it Terminal A. You will be prompted for your RSA passcode, which is your password followed by the 8-digit code from the authenticator app.
ssh -m hmac-sha2-256-etm@openssh.com User.Name@hera-rsa.boulder.rdhpcs.noaa.gov -p22
- Create directories
In Terminal A navigate to the project directory on scratch1
and create some directories. If using a shared directory such as htc4sa/
, make a directory to save your work within this directory (.e.g., User.Name/
). Change your working directory to this directory, and make a directory for the current project examples/ss3/
4.
# navigate to project directory
cd /scratch1/NMFS/project_name/
# create new directory
mkdir User.Name/
# navigate into new directory
cd User.Name/
# create directory for SLURM scripts and logs
mkdir -p examples/ss3/
- Transfer files
Open a second PowerShell terminal in the NSASS-HTC-HPC-Computing
directory on your machine. This will be your local workstation, call it Terminal B. Use this terminal window to upload via scp the needed files (examples/hera/ss3/upload.example-ss3.tar.gz
and apptainer/linux-r4ss-v4.sif
) to Hera. The upload.example-ss3.tar.gz
will be uploaded to your directory within the project directory on scratch1
. Make sure your VPN is active when attempting to upload using the DTN. You will be prompted for your RSA passcode after each scp command. Note that you will need to specify the MAC protocol needed for the scp file transfer similar to what was done for the initial ssh connection using scp -o MACs=hmac-sha2-256-etm@openssh.com
.
scp -o MACs=hmac-sha2-256-etm@openssh.com examples/hera/ss3/upload.example-ss3.tar.gz User.Name@dtn-hera.fairmont.rdhpcs.noaa.gov:/scratch1/NMFS/project_name/User.Name/examples/ss3/
scp -o MACs=hmac-sha2-256-etm@openssh.com apptainer/linux-r4ss-v4.sif User.Name@dtn-hera.fairmont.rdhpcs.noaa.gov:/scratch1/NMFS/project_name/User.Name/examples/ss3/
- Prepare files and submit job on Hera
In Terminal A, un-tar upload.example-ss3.tar.gz
, change the permissions/line endings for slurm_scripts/parallel-submit.sh
and execute the script.
tar -xzf upload.example-ss3.tar.gz
chmod 777 slurm_scripts/parallel-submit.sh
dos2unix slurm_scripts/parallel-submit.sh
./slurm_scripts/parallel-submit.sh
After job submission you can check on job status using squeue -u $USER
or you can use the following for more detailed information.
# count the number of output files (End.tar.gz) that have been produced
find . -type f -name End.tar.gz -exec echo . \; | wc -l
# list the size and location of all of the End.tar.gz files
find . -type f -name End.tar.gz -exec du -ch {} +
- Download jobs and clean-up workspace
Once all jobs are completed (or the job has hit its time limit), use your Terminal B to download your jobs.
scp -o MACs=hmac-sha2-256-etm@openssh.com -r User.Name@dtn-hera.fairmont.rdhpcs.noaa.gov:/scratch1/NMFS/project_name/User.Name/examples/ss3/output/ examples/hera/ss3/
Lastly in Terminal A, clean-up the /scratch1/NMFS/project_name/User.Name/
directory since it is a shared space.
Make sure that you have verified that your jobs completed successfully and that all results have been downloaded before cleaning-up the directory.
# move back up a level in the directory structure
cd ..
# delete the ss3/ directory
rm -r ss3/
5 Process results
After results are downloaded they can be processed in R to extract the model run times, time series of estimated biomass for each model run, and Mohn’s rho across retrospective peels for a given model ‘family’.
Show output processing code
# iterate over output files and extract quantities
library(data.table)
library(magrittr)
library(r4ss)
= list.dirs(paste0(proj_dir,"/examples/hera/ss3/output/"),recursive=FALSE,full.names=FALSE)
output_dirs = comptime_dt.list = as.list(rep(NA,length(output_dirs)))
ssb_dt.list = as.list(rep(NA,length(output_dirs)))
ss_output_list names(ss_output_list) = output_dirs
for(i in seq_along(output_dirs)){
= strsplit(output_dirs[i],"-")[[1]][2]
tmp_model = as.numeric(strsplit(output_dirs[i],"-")[[1]][3])
tmp_peel = as.numeric(strsplit(output_dirs[i],"-")[[1]][1])
tmp_index
# check if the End.tar.gz file got created
if(file.exists(paste0(proj_dir,"/examples/hera/ss3/output/",output_dirs[i],"/End.tar.gz")))
{# get snapshot of original files in the directory
= list.files(paste0(proj_dir,"/examples/hera/ss3/output/",output_dirs[i],"/"))
tmp_orig_files
# un-tar if the End.tar.gz file gets made
shell(paste0("powershell cd ", paste0(proj_dir,"/examples/hera/ss3/output/",output_dirs[i],"/"), ";tar -xzf End.tar.gz"))
# check if runtime.txt was produced and extract output
if(file.exists(paste0(proj_dir,"/examples/hera/ss3/output/",output_dirs[i],"/runtime.txt"))){
= readLines(paste0(proj_dir,"/examples/hera/ss3/output/",output_dirs[i],"/runtime.txt")) %>%
tmp_time gsub(".*?([0-9]+).*", "\\1", .) %>%
as.numeric(.) %>%
as.data.table(.) %>%
setnames(.,".","time")
= data.table(id = output_dirs[i])
comptime_dt.list[[i]] $index = tmp_index
comptime_dt.list[[i]]$model = tmp_model
comptime_dt.list[[i]]$peel = tmp_peel
comptime_dt.list[[i]]$hera_start = as.POSIXct(tmp_time$time[1],origin="1970-01-01")
comptime_dt.list[[i]]$hera_end = as.POSIXct(tmp_time$time[2],origin="1970-01-01")
comptime_dt.list[[i]]$hera_runtime = tmp_time$time[3]/60
comptime_dt.list[[i]]
# clean-up
rm(list=c("tmp_time"))
}
# if "ss_report.RData" is produced put it into the storage list
if(file.exists(paste0(proj_dir,"/examples/hera/ss3/output/",output_dirs[i],"/ss_report.RData"))){
load(paste0(proj_dir,"/examples/hera/ss3/output/",output_dirs[i],"/ss_report.RData"))
= ss_report
ss_output_list[[i]]
= ss_report$derived_quants %>%
ssb_dt.list[[i]] as.data.table(.) %>%
%in% paste0("SSB_", ss_report$startyr:ss_report$endyr)] %>%
.[Label := output_dirs[i]] %>%
.[,id :=Value/subset(ss_report$derived_quants,Label=="SSB_Virgin")$Value] %>%
.[,sbo:=sapply(Label,function(x)as.numeric(strsplit(x,"_")[[1]][2]))] %>%
.[,yr
.[,.(id,yr,sbo)]# clean-up
rm(list=c("ss_report"))
}
# clean-up
file.remove(paste0(proj_dir,"/examples/hera/ss3/output/",output_dirs[i],"/",setdiff(list.files(paste0(proj_dir,"/examples/hera/ss3/output/",output_dirs[i],"/")),tmp_orig_files)))
rm(list=c("tmp_orig_files"))
else {
} = data.table(id=output_dirs[i],index=tmp_index,model=tmp_model,peel=tmp_peel,hera_start=NA,hera_end=NA,hera_runtime=NA)
comptime_dt.list[[i]] = data.table(id=output_dirs[i],yr=2023,sbo=NA)
ssb_dt.list[[i]]
}
# clean-up
rm(list=c("tmp_model","tmp_peel","tmp_index"))
}
= rbindlist(na.omit(comptime_dt.list))
comptime_dt = rbindlist(ssb_dt.list) %>% merge(comptime_dt[,.(id,index,model,peel)],.,by="id")
ssb_dt = na.omit(ss_output_list)
ss_output_list
# adjust times to account for the fact that model 79 did not finish within the 1 hour allocation
$hera_start[79] = comptime_dt$hera_start[80]
comptime_dt$hera_end[79] = comptime_dt$hera_start[79] + 60^2
comptime_dt$hera_runtime[79] = 60
comptime_dt
# save
fwrite(comptime_dt,file=paste0(proj_dir,"/examples/hera/ss3/output/comptime_dt.csv"))
fwrite(ssb_dt,file=paste0(proj_dir,"/examples/hera/ss3/output/ssb_dt.csv"))
# calculate Mohn's rho
= unique(comptime_dt$model)
unique_models = as.list(rep(NA,length(unique_models)))
retro_dt.list
for(i in seq_along(unique_models)){
= unique_models[i]
tmp_model
= data.table(model=tmp_model)
retro_dt.list[[i]] $type = c("SBO")
retro_dt.list[[i]]$rho = NA
retro_dt.list[[i]]
if(uniqueN(na.omit(ssb_dt[model==tmp_model])$peel)==5){
= ssb_dt[model==tmp_model]
tmp_dt = tmp_dt[peel==0]
base_dt = max(base_dt$yr) - 1:4
year_vec = rep(NA,length(year_vec))
bias_vec # calc Mohn's rho for runs where all models completed
for(j in 1:4){
= (ssb_dt[model==tmp_model&peel==j&yr==year_vec[j]]$sbo - base_dt[yr==year_vec[j]]$sbo)/base_dt[yr==year_vec[j]]$sbo
bias_vec[j]
}$rho = mean(bias_vec)
retro_dt.list[[i]]rm(list=c("tmp_dt","base_dt","year_vec","bias_vec"))
}
rm(list=c("tmp_model"))
}
= rbindlist(retro_dt.list)
retro_dt fwrite(retro_dt,file=paste0(proj_dir,"/examples/hera/ss3/output/retro_dt.csv"))
5.1 Job runtime
The 90 jobs run on Hera completed 4.15 hours of calculations (2.77 minutes per job) in an elapsed time of 1 hour.
Excluding the job that timed out at the 1-hour limit the 89 jobs run on Hera completed 3.15 hours of calculations (2.12 minutes per job) in an elapsed time of 14.48 minutes or \(\sim\) 13 times faster (Figure 1).
Show plotting code
library(ggplot2)
= comptime_dt_minus %>%
p %>%
.[,.(id,hera_start,hera_end)] melt(.,id.vars="id") %>%
:=ifelse(variable%in%c("hera_start"),"start","end")] %>%
.[,variabledcast(.,id~variable) %>%
order(start)] %>%
.[ggplot() +
xlab("Time (GMT)") +
ylab("Job") +
geom_segment(aes(x=start,xend=end,y=id,yend=id),color="#003087",alpha=0.5,linewidth=2) +
theme(panel.background = element_rect(fill = "transparent", color = "black", linetype = "solid"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
strip.background =element_rect(fill="transparent"),
legend.key = element_rect(fill = "transparent"),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
p
# save plot
ggsave(
"hera-ss3-elapsed.png",
plot = p,
device = "png",
path = paste0(proj_dir,"/assets/static/"),
width = 8,
height = 4.5,
units = c("in"),
dpi = 300,
bg = "transparent")
6 Example results
6.1 Retrospectives
Retrospective plots of static biomass depletion for the SS3 test models are shown in Figure 2.
Show plotting code
= as.list(rep(NA,uniqueN(ssb_dt$model)))
text_dt.list for(i in seq_along(text_dt.list)){
= ssb_dt[model==unique(ssb_dt$model)[i]]
tmp_dt = min(tmp_dt$yr)
tmp_min_yr = data.table(model=unique(ssb_dt$model)[i],yr=tmp_min_yr,sbo=0.2,rho=round(retro_dt[model==unique(ssb_dt$model)[i]]$rho,digits=2))
text_dt.list[[i]]
}= rbindlist(text_dt.list)
text_dt
= ssb_dt %>%
p ggplot() +
facet_wrap(~model,scales="free_x") +
xlab("Year") +
ylab(expression(SB/SB[0])) +
ylim(0,NA) +
geom_hline(yintercept=0) +
geom_path(aes(x=yr,y=sbo,color=as.character(peel),group=id)) +
geom_text(data=text_dt,aes(x=yr,y=sbo,label=rho),size=3,hjust = 0) +
::scale_color_viridis("Peel",begin = 0.1,end = 0.8,direction = 1,option = "H",discrete=TRUE) +
viridis::scale_fill_viridis("Peel",begin = 0.1,end = 0.8,direction = 1,option = "H",discrete=TRUE) +
viridistheme(panel.background = element_rect(fill = "transparent", color = "black", linetype = "solid"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
strip.background =element_rect(fill="transparent"),
legend.key = element_rect(fill = "transparent"))
p
# save plot
ggsave(
"hera-ss3-retro.png",
plot = p,
device = "png",
path = paste0(proj_dir,"/assets/static/"),
width = 8,
height = 4.5,
units = c("in"),
dpi = 300,
bg = "transparent")
Footnotes
This directory can be named anything that you like, in this case
singularity
is a legacy name from an earlier version of the code written before Singularity changed its name to Apptainer.↩︎This may take ~10-15 minutes depending on how long it takes to install R packages.↩︎
While this workflow leads to better scheduling and makes HTC applications possible on Hera it may not be computationally efficient if users do not make use of all CPUs on a given node. For example, given that an entire compute node is requested, the user’s allocation on Hera will be billed for use of all CPUs on that node even if not all are in use.↩︎
Note that this path should match the path defined in
hera_job_directories.txt
.↩︎