Nodes and Access
Caltech Tier2 and iBanks GPU Cluster use share CEPH Shared Filesystem and each user has its own home directory on CEPH (default home directory for all users).
An SSH key is the only authentication (Administrators use 2FA). Please let the admins know (hep-wheel AT caltech.edu) in case of issues.
Login nodes:
- login-1.ultralight.org, login-2.ultralight.org and login-3.ultralight.org – can be used to access Caltech Tier2 and GPU Clusters. Be aware that login nodes do not have any GPUs attached. (There is an option to use GPU HTcondor Scheduling, but it is WIP.)
Data Storage on GPU Nodes
The home directory should be used for software and although there is room, please prevent putting too much data within your home directory.
/wntmp/
– volume is mounted on all nodes, not all SSD’s. This is the preferred temporary location for HTCondor jobs.
/data/
– volume is mounted on some GPU nodes, not all SSD’s. This is the preferred temporary location for data needed for intensive I/O.
/imdata/
– volume on GPU Nodes is a ramdisk of 40G with very high throughput, but utilizing the RAM of the machine. Please use this in case of need of very high i/o, but clean the space tightly, as this will use the node memory. There is a 2-day-since-last-access retention policy on it.
/storage/cms/
– path is the read-only access to the full Caltech Tier2 Ceph Storage. (Total space: ~5PB)
/storage/af/
– path for user home directories. (Total space: ~150TB). /storage/af/user/${USER}
– is your home directory.
Grid proxy
Before working on Tier2, make sure you have a valid grid proxy, also known as the VOMS proxy or X509 proxy. Valid X509 Proxy is needed for condor job submissions and for any file transfers. For CMS Users, please follow the following documentation: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideRunningGridPrerequisites#Grid_certificate
For non-CMS users – please use CILogon and the following page: https://cilogon.org
Accessing bulk CMS storage
Once you have a valid proxy, use the gfal-ls, gfal-rm, gfal-copy
tools to access the large-scale Ceph storage. This should be used to store multi-terabyte datasets for the longer term and for interfacing with the CMS grid.
$ gfal-ls -l gsiftp://transfer-lb.ultralight.org//storage/cms/store/user/ drwxr-xr-x 1 0 11606 4096 Jun 20 2010 user1 drwxrwxr-x 1 0 12014 4096 Jun 13 2011 user2 drwxr-xr-x 1 0 11774 4096 May 4 2012 user3
Accessing T2 fast shared storage
Smaller files that need to be accessed often can be stored in your home directory, this is a shared filesystem with read/write access from all interactive and worker nodes. Generally, it can be treated as a fast local disk.
Setup CMSSW
Works the same as on lxplus:
$ source /cvmfs/cms.cern.ch/cmsset_default.sh $ cmsrel CMSSW_7_6_6 $ cd CMSSW_7_6_6/src/ $ cmsenv
Singularity
If your CMSSW release needs SL6, you should run it inside the singularity container. singularity shell -B /cvmfs /cvmfs/singularity.opensciencegrid.org/cmssw/cms:rhel6
export SCRAM_ARCH=slc6_amd64_gcc700
source /cvmfs/cms.cern.ch/cmsset_default.sh
then cmsrel
as usual to get the release.
Use official CMS Singularity images:
Rhel6 – /cvmfs/singularity.opensciencegrid.org/cmssw/cms:rhel6
Rhel7 – /cvmfs/singularity.opensciencegrid.org/cmssw/cms:rhel7
Rhel8 – /cvmfs/singularity.opensciencegrid.org/cmssw/cms:rhel8
Submitting HTCondor jobs
1. HTCondor script jobs (batch):
In order to submit a condor job, you need a JDL (job description language) file and an executable shell script. JDL Example:
Universe = vanilla Executable = example_job.sh Arguments = 10 #The logs directory must exist before submitting job. Log = logs/example_job.$(Cluster).log Output = logs/example_job.out.$(Cluster).$(Process) Error = logs/example_job.err.$(Cluster).$(Process) Requirements=(TARGET.OpSysAndVer=="CentOS7" && regexp("blade-.*", TARGET.Machine)) #This is necessary to choose either rhel8(slc8)/rhel7 (slc7)/rhel6 (slc6) as needed +RunAsOwner = True +InteractiveUser = true +SingularityImage = "REPLACE_ME"
# REPLACE_ME with a path to your own singularity image or to cms official image.
# Latest CMS Singularity images are available here:
# /cvmfs/singularity.opensciencegrid.org/cmssw/cms:rhel6- latest RHEL 6 image
# /cvmfs/singularity.opensciencegrid.org/cmssw/cms:rhel7 - latest RHEL 7 image
# /cvmfs/singularity.opensciencegrid.org/cmssw/cms:rhel8 - latest RHEL 8 image.
+SingularityBindCVMFS = True run_as_owner = True #Provide information on proxy in order to access storage x509userproxy = $ENV(X509_USER_PROXY) #Don't request more than needed, otherwise, your job will wait longer in queue RequestDisk = 4 RequestMemory = 2000 RequestCpus = 1 #transfer this file back to the login node #use this for small files, like plots or txt files to an existing output directory on login-1 #Big outputs should be transferred within the job to /storage/cms/ using `gfal-copy` should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_output_files = output_small.txt transfer_input_files = my_calibrations.txt transfer_output_remaps = "output_small.txt=outputs/output_small.txt.$(Cluster).$(Process)" #This number can be used to queue more than one job Queue 1
and example_job.sh script file:
#!/bin/sh #Print out all bash commands set -x #Abort bash script on any error set -e #Print some basic debugging info echo "whoami="`whoami` echo "pwd="`pwd` echo "hostname="`hostname` echo "date="`date` env #print out proxy voms-proxy-info -all #Inside singularity, the scratch directory is here #This is also where the job starts out echo "TMP:" `df -h $TMP` echo "looking inside scratch directory BEFORE job" ls -al $TMP #Run cmsenv in an existing CMSSW directory on login-1 cd /storage/af/user/user1/CMSSW_10_2_0/src source /cvmfs/cms.cern.ch/cmsset_default.sh eval `scramv1 runtime -sh` #go back to scratch directory on worker node cd $TMP #your transfer_input_files are located in the working directory cat my_calibrations.txt #Run some ROOT code, produce output echo "my job output datacard or plot" > $TMP/output_small.txt echo "this is a placeholder for a big output file" > $TMP/output_big.dat #Return to non-CMSSW environment, which is required for gfal-copy eval `scram unsetenv -sh` #Output can be copied using `gfal-copy` to /mnt/hadoop (large files only) or handled by condor using `transfer_output_files = output.txt` (for small files only) #do NOT copy more than one file per job to Ceph - instead create compressed archives in case you need to produce multiple outputs per job gfal-copy -f --checksum-mode=both file://$TMP/output_big.dat gsiftp://transfer-lb.ultralight.org//storage/cms/store/user/user1/output_big.dat echo "looking inside scratch directory AFTER job" ls -al $TMP
2. HTCondor interactive jobs:
For interactive jobs (either singularity or bare metal job), you need to use condor_submit -i <yourjdlfile>
. Make sure you removed the ‘Executable’ parameter inside the JDL.
3. HTCondor GPU jobs:
To request GPU assigned to your job, you need to set 2 parameters inside your JDL:
+InteractiveGPUUser = true Request_gpus = 1
This will ensure that it matches your jobs to machines that have GPUs. Once your job starts to run, it will have environment variable exported with specific GPU assigned to the job, like:
CUDA_VISIBLE_DEVICES=0 _CONDOR_AssignedGPUs=CUDA0
Please ensure that these variables are used in your scripts.
Condor remarks:
Submit the job as follows:
$ condor_submit example_job.jdl Submitting job(s). 1 job(s) submitted to cluster 4. $ condor_q -- Schedd: login-1.tier2 : <10.3.10.4:9618?... @ 03/27/18 09:13:17 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS user CMD: /storage/user/user/sleep.sh 3/27 09:12 _ _ 1 1 4.0 $ condor_history -- Schedd: login-1.tier2 : <10.3.10.4:9618?... @ 03/27/18 09:16:13 ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 4.0 user 3/27 09:14 0+00:00:04 C 3/27 09:14 /storage/user/user/sleep.sh 1
To check to the status of the job:
$ condor_q -bet <JOB-ID>
Things to remember when submitting local condor jobs:
- The software environment is provided by Singularity, you need to specify it in the job description using
+SingularityImage = "/cvmfs/singularity.opensciencegrid.org/bbockelm/cms:rhel7"
(to mimic lxplus7) - You can ask to run on specific machines using the following job description expression:
Requirements=(TARGET.OpSysAndVer=="CentOS7" && regexp("blade-.*", TARGET.Machine))
- In order to submit jobs successfully, the following needs to be defined in the job description:
x509userproxy = $ENV(X509_USER_PROXY)
, and the corresponding environment variable must be defined. - It’s good practice to keep your jobs between 10 minutes and 1-2h in length
More generic information on condor batch jobs can be found on the lxbatch @ CERN documentation: http://batchdocs.web.cern.ch/batchdocs/local/quick.html
More condor examples are here: http://research.cs.wisc.edu/htcondor/manual/v8.7/2_5Submitting_Job.html
Debugging jobs while running
Any running job allows the user to ssh to job itself and debug job runtime:
$ condor_ssh_to_job 594650.0
Welcome to slot2@blade-1.tier2!
Your condor job is running with pid(s) 24522.
$ pwd
/wntmp/condor/execute/dir_24509
Make sure to log out of the job once you are done debugging.
Access on the Tier2
Login to lxplus.cern.ch or login-1; create your own proxy with voms-proxy-init -voms cms
. For any command below, you always need to have a valid proxy. Additionally, keep in mind that the gfal-tools are not compatible with the CMSSW / cmsenv
environment, therefore it needs to be unset: eval `scram unsetenv -sh`
before using gfal-tools.
To list a specific directory (e.g. bottom list all /store/user/ directory and all users):
gfal-ls -l gsiftp://transfer-lb.ultralight.org//storage/cms/store/user/
To remove a specific file from T2_US_Caltech:
gfal-rm gsiftp://transfer-lb.ultralight.org//storage/cms/store/user/${USER}/my_file/from_my_analysis/file_name.root
You can also use rm -r to remove everything recursively (Be careful and double-check the full path of what you delete)
gfal-rm -r gsiftp://transfer-lb.ultralight.org//storage/cms/store/user/${USER}/my_file/
To copy a file from T2_US_Caltech to your working directory:
gfal-copy gsiftp://transfer-lb.ultralight.org//storage/cms/store/user/${USER}/test-file.root test-file.root
To copy a file from your working directory to T2_US_Caltech storage:
gfal-copy test-file.root gsiftp://transfer-lb.ultralight.org//storage/cms/store/user/${USER}/test-file.root