STAT DSCP: CHTC

UW Center for High Throughput Computing (CHTC)

We will use the CHTC to do distributed high-throughput computing via HTCondor software to run thousands of parallel jobs.

To work on the CHTC, login via "ssh NetID@learn.chtc.wisc.edu" (using your NetID). Here are some of the most important commands:

condor_submit <script.sub> submits the job(s) in script.sub.
condor_q lists my jobs.

condor_q <NetID> --hold lists reasons for my held (broken) jobs.
condor_q -better-analyze <JobID> indicates why a job isn't starting.
condor_q -hold: gives reason for held job.
condor_release: releases held jobs back to idle, which can help for transient HTCondor problems.
See condor_q help for more

condor_rm <NetID> cancels jobs belonging to <NetID>
condor_submit -i <script.sub> runs an interactive job to get a command line on a computing node.
condor_submit_dag <script.dag> runs a computation described by a directed acyclic graph (DAG) as in the sd example, below.

Note that after a job runs, any new files created by the job on the remote machine are copied to the directory in learn from which you ran condor_submit to launch the jobs. (New directories are not copied back.)

Here are the examples from lecture:

To get the example code from the tiny CHTC examples, run this command:
wget http://www.stat.wisc.edu/~jgillett/DSCP/CHTC/tinyExamples.tar
To get the example code from the CHTC sd (standard deviation) example, run this command:
wget http://www.stat.wisc.edu/~jgillett/DSCP/CHTC/sd.tar
To get the example code from the CHTC calling_R_or_python example, run this command:
wget http://www.stat.wisc.edu/~jgillett/DSCP/CHTC/calling_R_or_python.tar

If you need more disk space, you may as CHTC for a quota increase or they or I can help you with using their /staging folder for large files.

Here are links to more information:

Running R Jobs on CHTC with Apptainer Containers
DAGMan (for help with Directed Acyclic Graph code)
HTCondor manual (find details on a command here)
CHTC Computing Guides
CHTC help

Your CHTC accounts created for DSCP are temporary.
10 days after the last class day of the semester, your DSCP accounts may be removed, so you should copy any files you want to retain elsewhere. I suggest making a ".tar" file of your code and other human-written files (omitting most data files and most output files) and then copying that single file to your own computer.