Submitting an array job with OSG

Authors

Nicholas Ducharme-Barth

Megumi Oshima

Last updated

2024-11-06

In this example we step through submitting an array job on OSG where we want to run the same job in a number of directories. In this case the job is running a simple R script that reads in the data.csv file stored in the directory, fits a linear model, and writes the parameter estimates to a par.csv. We specify which directories we want to run jobs in as a part of the job-array using a text file to specify the directory path names on OSG.

The osg/array_lm example can be set-up either by cloning the repository git clone https://github.com/MOshima-PIFSC/NSASS-HTC-HPC-Computing.git, or stepping through the following code:

Coding alert!

Note: throughout this tutorial we are using User.Name as a stand-in for your actual username and osg.project_name as a stand-in for your project. In all cases, replace User.Name your actual user name and osg.project_name with your specific project name.

1 Setup data inputs and directories

Define a relative path, we are starting from the root directory of this project.

proj_dir = this.path::this.proj()

Define directory names for each run.

dir_name = paste0("rep_0", 0:9)

Iterate across directories, create them, and then write a simple .csv file into them containing data to fit a linear model.

for(i in seq_along(dir_name)){
    
    if(!file.exists(file.path(proj_dir, "example", "OSG", "array_lm", "inputs", dir_name[i], "data.csv")))
    {
        set.seed(i)
        dir.create(file.path(proj_dir, "examples", "OSG", "array_lm", "inputs", dir_name[i]), recursive=TRUE)
        tmp = data.frame(x=1:1000)
        tmp$y = (i + (0.5*i)*tmp$x) + rnorm(1000,0,i)
        write.csv(tmp, file = file.path(proj_dir, "examples", "OSG", "array_lm", "inputs", dir_name[i], "data.csv"))
    }
}

Write an R script to read in the data, run a linear model, and report back the estimated parameters.

if(!file.exists(file.path(proj_dir, "examples", "OSG", "array_lm", "inputs", "run_lm_osg_array.r"))){
    script_lines = c("tmp=read.csv('data.csv')", 
    "fit=lm(y~x,data=tmp)", 
    "out = data.frame(par=unname(fit$coefficients))", 
    "write.csv(out,file='par.csv')"
    )
    writeLines(script_lines, con = file.path(proj_dir, "examples", "OSG", "array_lm", "inputs", "run_lm_osg_array.r"))
}

Write a text file containing the full path names for where the directories will be on OSG.

if(!file.exists(file.path(proj_dir, "examples", "OSG", "array_lm", "inputs", "osg_job_directories.txt"))){
    dir_lines = paste0("./../../inputs/", dir_name, "/")
    writeLines(dir_lines, con = file.path(proj_dir, "examples", "OSG", "array_lm", "inputs", "osg_job_directories.txt"))
}

In addtion to the input files, you will need to have 2 additional scripts: a wrapper script (wrapper.sh) and a submission script (submission.sub), examples of both can be found in examples/OSG/array_lm/scripts. To easily upload all the necessary files at once, compress the entire array_lm/ directory as a tar.gz file upload.array_lm.tar.gz.

system(paste0("powershell cd ", file.path(proj_dir, "examples", "OSG", "array_lm"), ";tar -czf upload.array_lm.tar.gz * "))

2 OSG workflow

Connect to OSG
As mentioned, access to OSG and file transfer is done using a pair of Terminal/PowerShell windows, we will call them Terminal A and Terminal B. In Terminal A, log onto your access point and create a directory for this example.

ssh User.Name@ap21.uc.osg-htc.org
mkdir array_lm

Transfer files

We will upload the compressed fileupload.array_lm.tar.gz into the OSG directory that you just created. The following files should be included:

all replicate data files to run the linear models on;
the r script run_lm_osg_array.r;
the text file osg_job_directories.txt with the directory names;
the wrapper script wrapper.sh which unpacks files, sets up job timing, executes the R script, and packages results;
the submission script submission.sub;
and a bash script prep.sh that prepares the files to be run on HTCondor, including changing file permissions, making directory structures, and changing dos2unix line endings.

For this example, we are using the container that was built in the Hera SS3 example. If you are unsure of how to build a container or access it in OSG please refer back to that example.

Coding alert!

In the submission script submission.sub you will need to change the following before you upload and run the script:

Line 7: Change User.Name to your user name.
Line 15: Change User.Name to your user name and linux-r4ss-v4.sif to the name of your container file. For more information on building containers, see Running an array of SS3 jobs on Hera.
Line 26: Change the project name from osg.project_name to the name of your OSG project.
Lines 30 and 37: Change User.Name in the file path to your user name.
NOTE There cannot be an empty last line in submission.sub. Make sure the script is 34 lines. If in doubt, backspace up to the last letter on the last line.

In Terminal B, navigate to the directory where the compressed file is on your local computer and run:

scp upload.array_lm.tar.gz User.Name@ap21.uc.osg-htc.org:/home/User.Name/array_lm

You will be prompted for your passphrase and RSA code before the file transfers. Once the file transfer is complete, go back to Terminal A and you can untar the files by navigating to the array_lm directory and running:

tar - xvf upload.array_lm.tar.gz

Prepare scripts

Still in Terminal A, change the permissions and line endings for osg_prep.sh. Navigate to the scripts/bash directory and change the change the line endings for the prep.sh script and then execute it to prepare the other scripts as neccessary.

# navigate to directory
cd scripts/bash
# change permissions to make the file executable 
chmod 777 prep.sh 
# change line endings 
dos2unix prep.sh 
# run script
./prep.sh

Submit job

Once you are ready, you can submit the job by running the command condor_submit.

# navigate out of bash directory and into condor_submit directory
cd ../condor_submit
# submit job
condor_submit submission.sub

While your job is running, you can check on it using the following commands:

condor_q shows status of all of your jobs: running, idle, or held
condor_q -run shows running jobs only
condor_q -hold shows jobs that are held

Using these commands you can get the job id number and peek directly at what is happening on the compute node using condor_ssh_to_the_job <job_id>. This is most useful for look at longer jobs or to see if intermediate files are being produced correctly.

For more information on tracking and restarting jobs, see here. ##TODO: add link, check that the documentation below through log files is included in ss example

Download results

Once the jobs have completed, you can retrieve the results for further analysis on your local computer. The easiest way to do this is to compress all of the directories in array_lm/inputs. In Terminal A, logged into OSG, run:

tar -czf download.array_lm.tar.gz ./rep_*

We use the wildcard character * to indicate that we want to include everything with the name starting with rep_. This will give us all of the directory folders. Then on your local compter create a new directory outputs to put all of the downloaded results. In Terminal B, navigate to the outputs folder and run:

# navigate to outputs folder, assuming you are in array_lm/
cd outputs
# download all files from array_lm/inputs on OSG into current directory
scp -r User.Name@ap21.uc.osg-htc.org:/home/User.Name/array_lm/inputs/download.array_lm.tar.gz ./

Again you will be prompted for your passphrase and RSA code before any files can transfer.

You can then unzip the files by running:

# untar results
tar -xzf download.array_lm.tar.gz

in Terminal B.

3 Process the output

In R, iterate through the sub-directories of the input and output data to extract the results of the linear model fits, and the model run time information.

Show code

library(data.table)
library(magrittr)

input_data.list = as.list(rep(NA,10))
output_data.list = as.list(rep(NA,10))
runtime_data.list = as.list(rep(NA,10))

for(i in seq_along(dir_name)){
    # get input data
        input_data.list[[i]] = fread(file.path(proj_dir, "examples", "OSG", "array_lm", "inputs", dir_name[i],"data.csv")) %>%
            .[,.(x,y)] %>%
            .[,model := factor(as.character(i),levels=as.character(1:10))] %>%
            .[,.(model,x,y)]
    
    # untar results
        system(paste0("powershell cd ", file.path(proj_dir, "examples", "OSG", "array_lm", "outputs", dir_name[i],"/"), ";tar -xzf End.tar.gz"))

    # get output
        output_data.list[[i]] = fread(file.path(proj_dir, "examples", "OSG", "array_lm", "outputs", dir_name[i],"par.csv")) %>%
            .[,.(par)] %>%
            .[,model := factor(as.character(i),levels=as.character(1:10))] %>%
            .[,.(model,par)] %>%
            melt(.,id.vars="model") %>%
            .[,variable:=c("intercept","slope")] %>%
            dcast(.,model ~ variable) %>%
            merge(.,input_data.list[[i]][,.(model,x)],by="model") %>%
            .[,pred_y := intercept+slope*x] %>%
            .[,.(model,x,pred_y)]
    # get time
        runtime_data.list[[i]] = readLines(file.path(proj_dir, "examples", "OSG", "array_lm", "outputs", dir_name[i],"runtime.txt")) %>%
            gsub(".*?([0-9]+).*", "\\1", .) %>%
            as.numeric(.) %>%
            as.data.table(.) %>%
            setnames(.,".","time") %>%
            .[,model := factor(as.character(i),levels=as.character(1:10))] %>%
            melt(.,id.vars="model") %>%
            .[,variable:=c("start","end","runtime")] %>%
            dcast(.,model ~ variable) %>%
            .[,.(model,start,end,runtime)]
}

input_data = rbindlist(input_data.list)
output_data = rbindlist(output_data.list)
runtime_data = rbindlist(runtime_data.list)

The jobs started execution at 2024-11-02 00:50:03 and all finished by 2024-11-02 00:50:31 for an elapsed runtime of 28 seconds and a total computation time of 6 seconds. Use of Hera resulted in a job completing 0.21\(\times\) faster. Figure 1 shows the simulated data and estimated linear fits for each model run in the job-array.

Show code

library(ggplot2)
input_data %>%
ggplot() +
geom_point(aes(x=x,y=y,fill=model),alpha=0.05,size=5,shape=21) +
geom_line(data=output_data,aes(x=x,y=pred_y,color=model),linewidth=2)