{"id":141,"date":"2020-06-25T23:43:03","date_gmt":"2020-06-25T23:43:03","guid":{"rendered":"https:\/\/tier2.hep.caltech.edu\/?page_id=141"},"modified":"2023-12-06T14:39:36","modified_gmt":"2023-12-06T14:39:36","slug":"user-guidebook","status":"publish","type":"page","link":"https:\/\/tier2.hep.caltech.edu\/?page_id=141","title":{"rendered":"User Guidebook"},"content":{"rendered":"\r\n\r\n\r\n<h2 class=\"wp-block-heading\">Nodes and Access<\/h2>\r\n\r\n\r\n\r\n<p>Caltech Tier2 and iBanks GPU Cluster use share CEPH Shared Filesystem and each user has its own home directory on CEPH (default home directory for all users).<\/p>\r\n\r\n\r\n\r\n<p>An SSH key is the only authentication (Administrators use 2FA). Please let the admins know (hep-wheel AT caltech.edu) in case of issues.<\/p>\r\n\r\n\r\n\r\n<p>Login nodes:<\/p>\r\n\r\n\r\n\r\n<ul>\r\n<li><strong>login-1.ultralight.org<\/strong>, <strong>login-2.ultralight.org and login-3.ultralight.org<\/strong>\u00a0&#8211; can be used to access Caltech Tier2 and GPU Clusters. Be aware that login nodes do not have any GPUs attached. (There is an option to use GPU HTcondor Scheduling, but it is WIP.)<\/li>\r\n<\/ul>\r\n\r\n\r\n\r\n<h3 class=\"wp-block-heading\">Data Storage on GPU Nodes<\/h3>\r\n\r\n\r\n\r\n<p>The home directory should be used for software and although there is room, please prevent putting too much data within your home directory.<\/p>\r\n\r\n\r\n\r\n<p><code>\/wntmp\/<\/code>\u00a0&#8211; volume is mounted on all nodes, not all SSD&#8217;s. This is the preferred temporary location for HTCondor jobs.<\/p>\r\n\r\n\r\n\r\n<p><code>\/data\/<\/code>\u00a0&#8211; volume is mounted on some GPU nodes, not all SSD&#8217;s. This is the preferred temporary location for data needed for intensive I\/O.<\/p>\r\n\r\n\r\n\r\n<p><code>\/imdata\/<\/code>\u00a0&#8211; volume on GPU Nodes is a ramdisk of 40G with very high throughput, but utilizing the RAM of the machine. Please use this in case of need of very high i\/o, but clean the space tightly, as this will use the node memory. There is a 2-day-since-last-access retention policy on it.<\/p>\r\n\r\n\r\n\r\n<p><code>\/storage\/cms\/<\/code>\u00a0&#8211; path is the read-only access to the full Caltech Tier2 Ceph Storage. (Total space: ~5PB)<\/p>\r\n<p><code>\/storage\/af\/<\/code>\u00a0&#8211; path for user home directories. (Total space: ~150TB).\u00a0<code>\/storage\/af\/user\/${USER}<\/code>\u00a0 &#8211; is your home directory.<\/p>\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n<h2 class=\"wp-block-heading\">Grid proxy<\/h2>\r\n\r\n\r\n\r\n<p>Before working on Tier2, make sure you have a valid grid proxy, also known as the VOMS proxy or X509 proxy. \u00a0Valid X509 Proxy is needed for condor job submissions and for any file transfers. For CMS Users, please follow the following documentation: <a href=\"https:\/\/twiki.cern.ch\/twiki\/bin\/view\/CMSPublic\/SWGuideRunningGridPrerequisites#Grid_certificate\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/twiki.cern.ch\/twiki\/bin\/view\/CMSPublic\/SWGuideRunningGridPrerequisites#Grid_certificate<\/a><\/p>\r\n<p>For non-CMS users &#8211; please use CILogon and the following page: <a href=\"https:\/\/cilogon.org\">https:\/\/cilogon.org<\/a><\/p>\r\n\r\n\r\n\r\n\r\n\r\n<pre class=\"wp-block-preformatted\">\u00a0<\/pre>\r\n\r\n\r\n\r\n<h2 class=\"wp-block-heading\" id=\"accessing-bulk-cms-storage\">Accessing bulk CMS storage<\/h2>\r\n\r\n\r\n\r\n<p>Once you have a valid proxy, use the\u00a0<code>gfal-ls, gfal-rm, gfal-copy<\/code>\u00a0tools to access the large-scale Ceph storage. This should be used to store multi-terabyte datasets for the longer term and for interfacing with the CMS grid.<\/p>\r\n\r\n\r\n\r\n<pre class=\"wp-block-preformatted\"><strong>$\u00a0gfal-ls -l gsiftp:\/\/transfer-lb.ultralight.org\/\/storage\/cms\/store\/user\/<\/strong>\r\ndrwxr-xr-x\u00a0 \u00a01 0\u00a0 \u00a0 \u00a011606\u00a0 \u00a0 \u00a0 4096 Jun 20\u00a0 2010 user1\r\ndrwxrwxr-x\u00a0 \u00a01 0\u00a0 \u00a0 \u00a012014\u00a0 \u00a0 \u00a0 4096 Jun 13\u00a0 2011 user2\r\ndrwxr-xr-x\u00a0 \u00a01 0\u00a0 \u00a0 \u00a011774\u00a0 \u00a0 \u00a0 4096 May\u00a0 4\u00a0 2012 user3<\/pre>\r\n\r\n\r\n\r\n<h2 class=\"wp-block-heading\" id=\"accessing-t2-fast-shared-storage\">Accessing T2 fast shared storage<\/h2>\r\n\r\n\r\n\r\n<p>Smaller files that need to be accessed often can be stored in your home directory, this is a shared filesystem with read\/write access from all interactive and worker nodes. Generally, it can be treated as a fast local disk.<\/p>\r\n\r\n\r\n\r\n<h2 class=\"wp-block-heading\" id=\"setup-cmssw\">Setup CMSSW<\/h2>\r\n\r\n\r\n\r\n<p>Works the same as on lxplus:<\/p>\r\n\r\n\r\n\r\n<pre class=\"wp-block-preformatted\"><strong>$\u00a0source \/cvmfs\/cms.cern.ch\/cmsset_default.sh\u00a0<\/strong>\r\n<strong>$\u00a0cmsrel CMSSW_7_6_6<\/strong>\r\n<strong>$\u00a0cd CMSSW_7_6_6\/src\/<\/strong>\r\n<strong>$\u00a0cmsenv<\/strong><\/pre>\r\n\r\n\r\n\r\n<h3 class=\"wp-block-heading\" id=\"singularity\">Singularity<\/h3>\r\n\r\n\r\n\r\n<p>If your CMSSW release needs SL6, you should run it inside the singularity container.\u00a0<code>singularity shell -B \/cvmfs \/cvmfs\/singularity.opensciencegrid.org\/cmssw\/cms:rhel6<\/code><code>export SCRAM_ARCH=slc6_amd64_gcc700 <\/code><code>source \/cvmfs\/cms.cern.ch\/cmsset_default.sh<\/code><\/p>\r\n\r\n\r\n\r\n<p>then\u00a0<code>cmsrel<\/code>\u00a0as usual to get the release.<\/p>\r\n<p>Use official CMS Singularity images:<br \/><strong>Rhel6 &#8211; \/cvmfs\/singularity.opensciencegrid.org\/cmssw\/cms:rhel6<\/strong><br \/><strong>Rhel7 &#8211; \/cvmfs\/singularity.opensciencegrid.org\/cmssw\/cms:rhel7<\/strong><br \/><strong>Rhel8 &#8211; \/cvmfs\/singularity.opensciencegrid.org\/cmssw\/cms:rhel8<\/strong><\/p>\r\n\r\n\r\n\r\n<h2 class=\"wp-block-heading\" id=\"submitting-htcondor-jobs-cpu-based-jobs\">Submitting HTCondor jobs<\/h2>\r\n\r\n\r\n\r\n<p><strong>1. HTCondor script jobs (batch):<\/strong><\/p>\r\n\r\n\r\n\r\n<p>In order to submit a condor job, you need a JDL (job description language) file and an executable shell script. JDL Example:<\/p>\r\n\r\n\r\n\r\n<pre class=\"wp-block-preformatted\">Universe = vanilla\r\nExecutable = example_job.sh\r\nArguments = 10\r\n\r\n#The logs directory must exist before submitting job.\r\nLog = logs\/example_job.$(Cluster).log\r\nOutput = logs\/example_job.out.$(Cluster).$(Process)\r\nError = logs\/example_job.err.$(Cluster).$(Process)\r\n\r\nRequirements=(TARGET.OpSysAndVer==\"CentOS7\" &amp;&amp; regexp(\"blade-.*\", TARGET.Machine))\r\n\r\n#This is necessary to choose either rhel8(slc8)\/rhel7 (slc7)\/rhel6 (slc6) as needed\r\n+RunAsOwner = True\r\n+InteractiveUser = true\r\n+SingularityImage = \"REPLACE_ME\"<br \/># REPLACE_ME with a path to your own singularity image or to cms official image.<br \/># Latest CMS Singularity images are available here:<br \/># <strong>\/cvmfs\/singularity.opensciencegrid.org\/cmssw\/cms:rhel6<\/strong>- latest RHEL 6 image<br \/># <strong>\/cvmfs\/singularity.opensciencegrid.org\/cmssw\/cms:rhel7<\/strong> - latest RHEL 7 image<br \/># <strong>\/cvmfs\/singularity.opensciencegrid.org\/cmssw\/cms:rhel8<\/strong> - latest RHEL 8 image.<\/pre>\r\n<pre class=\"wp-block-preformatted\">+SingularityBindCVMFS = True\r\nrun_as_owner = True\r\n\r\n#Provide information on proxy in order to access storage\r\nx509userproxy = $ENV(X509_USER_PROXY)\r\n\r\n#Don't request more than needed, otherwise, your job will wait longer in queue\r\nRequestDisk = 4\r\nRequestMemory = 2000\r\nRequestCpus = 1\r\n\r\n#transfer this file back to the login node\r\n#use this for small files, like plots or txt files to an existing output directory on login-1\r\n#Big outputs should be transferred within the job to \/storage\/cms\/ using `gfal-copy`\r\nshould_transfer_files = YES\r\nwhen_to_transfer_output = ON_EXIT\r\ntransfer_output_files = output_small.txt\r\ntransfer_input_files = my_calibrations.txt\r\ntransfer_output_remaps = \"output_small.txt=outputs\/output_small.txt.$(Cluster).$(Process)\"\r\n\r\n#This number can be used to queue more than one job\r\nQueue 1<\/pre>\r\n\r\n\r\n\r\n<p>and example_job.sh script file:<\/p>\r\n\r\n\r\n\r\n<pre class=\"wp-block-preformatted\">#!\/bin\/sh\r\n\r\n#Print out all bash commands\r\nset -x\r\n\r\n#Abort bash script on any error\r\nset -e\r\n\r\n#Print some basic debugging info\r\necho \"whoami=\"`whoami`\r\necho \"pwd=\"`pwd`\r\necho \"hostname=\"`hostname`\r\necho \"date=\"`date`\r\nenv\r\n\r\n#print out proxy\r\nvoms-proxy-info -all\r\n\r\n#Inside singularity, the scratch directory is here\r\n#This is also where the job starts out\r\necho \"TMP:\" `df -h $TMP`\r\necho \"looking inside scratch directory BEFORE job\"\r\nls -al $TMP\r\n\r\n#Run cmsenv in an existing CMSSW directory on login-1\r\ncd \/storage\/af\/user\/user1\/CMSSW_10_2_0\/src\r\nsource \/cvmfs\/cms.cern.ch\/cmsset_default.sh\r\neval `scramv1 runtime -sh`\r\n\r\n#go back to scratch directory on worker node\r\ncd $TMP\r\n\r\n#your transfer_input_files are located in the working directory\r\ncat my_calibrations.txt\r\n\r\n#Run some ROOT code, produce output\r\necho \"my job output datacard or plot\" &gt; $TMP\/output_small.txt\r\necho \"this is a placeholder for a big output file\" &gt; $TMP\/output_big.dat\r\n\r\n#Return to non-CMSSW environment, which is required for gfal-copy\r\neval `scram unsetenv -sh`\r\n\r\n#Output can be copied using `gfal-copy` to \/mnt\/hadoop (large files only) or handled by condor using `transfer_output_files = output.txt` (for small files only)\r\n#do NOT copy more than one file per job to Ceph - instead create compressed archives in case you need to produce multiple outputs per job\r\ngfal-copy -f --checksum-mode=both file:\/\/$TMP\/output_big.dat gsiftp:\/\/transfer-lb.ultralight.org\/\/storage\/cms\/store\/user\/user1\/output_big.dat\r\n\r\necho \"looking inside scratch directory AFTER job\"\r\nls -al $TMP<\/pre>\r\n\r\n\r\n\r\n<p><strong>2. HTCondor interactive jobs:<\/strong><\/p>\r\n\r\n\r\n\r\n<p>For interactive jobs (either singularity or bare metal job), you need to use\u00a0<code>condor_submit -i &lt;yourjdlfile&gt;<\/code>\u00a0. Make sure you removed the &#8216;Executable&#8217; parameter inside the JDL.<\/p>\r\n\r\n\r\n\r\n<p><strong>3. HTCondor GPU jobs:<\/strong><\/p>\r\n\r\n\r\n\r\n<p>To request GPU assigned to your job, you need to set 2 parameters inside your JDL:<\/p>\r\n\r\n\r\n\r\n<pre class=\"wp-block-preformatted\">+InteractiveGPUUser = true\r\nRequest_gpus = 1<\/pre>\r\n\r\n\r\n\r\n<p>This will ensure that it matches your jobs to machines that have GPUs. Once your job starts to run, it will have environment variable exported with specific GPU assigned to the job, like:<\/p>\r\n\r\n\r\n\r\n<pre class=\"wp-block-preformatted\">CUDA_VISIBLE_DEVICES=0\r\n_CONDOR_AssignedGPUs=CUDA0<\/pre>\r\n\r\n\r\n\r\n<p>Please ensure that these variables are used in your scripts.<\/p>\r\n\r\n\r\n\r\n<p><strong>Condor remarks:<\/strong><\/p>\r\n\r\n\r\n\r\n<p>Submit the job as follows:<\/p>\r\n\r\n\r\n\r\n<pre class=\"wp-block-preformatted\">$\u00a0<strong>condor_submit example_job.jdl<\/strong>\r\nSubmitting job(s).\r\n1 job(s) submitted to cluster 4.\r\n\r\n$<strong>\u00a0condor_q<\/strong>\r\n-- Schedd: login-1.tier2 : &lt;10.3.10.4:9618?... @ 03\/27\/18 09:13:17\r\nOWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS\r\nuser CMD: \/storage\/user\/user\/sleep.sh 3\/27 09:12 _ _ 1 1 4.0\r\n\r\n$\u00a0<strong>condor_history<\/strong>\r\n-- Schedd: login-1.tier2 : &lt;10.3.10.4:9618?... @ 03\/27\/18 09:16:13\r\nID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD\r\n4.0 user 3\/27 09:14 0+00:00:04 C 3\/27 09:14 \/storage\/user\/user\/sleep.sh 1<\/pre>\r\n\r\n\r\n\r\n<p>To check to the status of the job:<\/p>\r\n\r\n\r\n\r\n<pre class=\"wp-block-preformatted\">$\u00a0<strong>condor_q -bet &lt;JOB-ID&gt;<\/strong><\/pre>\r\n\r\n\r\n\r\n<p>Things to remember when submitting local condor jobs:<\/p>\r\n\r\n\r\n\r\n<ul>\r\n<li>The software environment is provided by\u00a0<a href=\"http:\/\/cms-sw.github.io\/singularity.html\" target=\"_blank\" rel=\"noreferrer noopener\">Singularity<\/a>, you need to specify it in the job description using\u00a0<code>+SingularityImage = \"\/cvmfs\/singularity.opensciencegrid.org\/bbockelm\/cms:rhel7\"<\/code>\u00a0(to mimic lxplus7)<\/li>\r\n<li>You can ask to run on specific machines using the following job description expression:\u00a0<code>Requirements=(TARGET.OpSysAndVer==\"CentOS7\" &amp;&amp; regexp(\"blade-.*\", TARGET.Machine))<\/code><\/li>\r\n<li>In order to submit jobs successfully, the following needs to be defined in the job description:\u00a0<code>x509userproxy = $ENV(X509_USER_PROXY)<\/code>, and the corresponding environment variable must be defined.<\/li>\r\n<li>It&#8217;s good practice to keep your jobs between 10 minutes and 1-2h in length<\/li>\r\n<\/ul>\r\n\r\n\r\n\r\n<p>More generic information on condor batch jobs can be found on the lxbatch @ CERN documentation:\u00a0<a href=\"http:\/\/batchdocs.web.cern.ch\/batchdocs\/local\/quick.html\" target=\"_blank\" rel=\"noreferrer noopener\">http:\/\/batchdocs.web.cern.ch\/batchdocs\/local\/quick.html<\/a><\/p>\r\n\r\n\r\n\r\n<p>More condor examples are here:\u00a0<a href=\"http:\/\/research.cs.wisc.edu\/htcondor\/manual\/v8.7\/2_5Submitting_Job.html\" target=\"_blank\" rel=\"noreferrer noopener\">http:\/\/research.cs.wisc.edu\/htcondor\/manual\/v8.7\/2_5Submitting_Job.html<\/a>\u00a0<\/p>\r\n\r\n\r\n\r\n<h3 class=\"wp-block-heading\" id=\"debugging-jobs-while-running\">Debugging jobs while running<\/h3>\r\n\r\n\r\n\r\n<p>Any running job allows the user to ssh to job itself and debug job runtime:<\/p>\r\n\r\n\r\n\r\n<pre class=\"wp-block-preformatted\">$\u00a0<strong>condor_ssh_to_job 594650.0<\/strong><br \/>Welcome to slot2@blade-1.tier2!<br \/>Your condor job is running with pid(s) 24522.<br \/>$\u00a0<strong>pwd<\/strong><br \/>\/wntmp\/condor\/execute\/dir_24509<\/pre>\r\n\r\n\r\n\r\n<p>Make sure to log out of the job once you are done debugging.<\/p>\r\n\r\n\r\n\r\n\r\n\r\n<h2 class=\"wp-block-heading\" id=\"access-on-the-tier2\">Access on the Tier2<\/h2>\r\n\r\n\r\n\r\n<p>Login to lxplus.cern.ch or login-1; create your own proxy with\u00a0<code>voms-proxy-init -voms cms<\/code>. For any command below, you always need to have a valid proxy. Additionally, keep in mind that the gfal-tools are not compatible with the CMSSW \/\u00a0<code>cmsenv<\/code>\u00a0environment, therefore it needs to be unset:\u00a0<code>eval `scram unsetenv -sh`<\/code>\u00a0before using gfal-tools.<\/p>\r\n\r\n\r\n\r\n<h4 class=\"wp-block-heading\" id=\"to-list-a-specific-directory-eg-bottom-list-all-storeuser-directory-and-all-users\">To\u00a0list\u00a0a specific directory (e.g. bottom list all \/store\/user\/ directory and all users):<\/h4>\r\n\r\n\r\n\r\n<pre class=\"wp-block-preformatted\">gfal-ls -l gsiftp:\/\/transfer-lb.ultralight.org\/\/storage\/cms\/store\/user\/<\/pre>\r\n\r\n\r\n\r\n<h4 class=\"wp-block-heading\" id=\"to-remove-a-specific-file-from-t2uscaltech\">To\u00a0remove\u00a0a specific file from T2_US_Caltech:<\/h4>\r\n\r\n\r\n\r\n<pre class=\"wp-block-preformatted\">gfal-rm gsiftp:\/\/transfer-lb.ultralight.org\/\/storage\/cms\/store\/user\/${USER}\/my_file\/from_my_analysis\/file_name.root<\/pre>\r\n\r\n\r\n\r\n<h4 class=\"wp-block-heading\" id=\"you-can-also-use-rm-r-to-remove-everything-recursively-be-careful-and-double-check-full-path-what-you-delete\">You can also use rm -r to remove everything recursively (Be careful and double-check the full path of what you delete)<\/h4>\r\n\r\n\r\n\r\n<pre class=\"wp-block-preformatted\">gfal-rm -r gsiftp:\/\/transfer-lb.ultralight.org\/\/storage\/cms\/store\/user\/${USER}\/my_file\/<\/pre>\r\n\r\n\r\n\r\n<h4 class=\"wp-block-heading\" id=\"to-copy-file-from-t2uscaltech-to-your-working-directory\">To\u00a0copy\u00a0a file from T2_US_Caltech to your working directory:<\/h4>\r\n\r\n\r\n\r\n<pre class=\"wp-block-preformatted\">gfal-copy gsiftp:\/\/transfer-lb.ultralight.org\/\/storage\/cms\/store\/user\/${USER}\/test-file.root test-file.root<\/pre>\r\n\r\n\r\n\r\n<h4 class=\"wp-block-heading\" id=\"to-copy-file-from-your-working-directory-to-t2uscaltech-storage\">To\u00a0copy a file from your working directory to T2_US_Caltech storage:<\/h4>\r\n\r\n\r\n\r\n<pre class=\"wp-block-preformatted\">gfal-copy test-file.root gsiftp:\/\/transfer-lb.ultralight.org\/\/storage\/cms\/store\/user\/${USER}\/test-file.root<\/pre>\r\n","protected":false},"excerpt":{"rendered":"<p>Nodes and Access Caltech Tier2 and iBanks GPU Cluster use share CEPH Shared Filesystem and each user has its own home directory on CEPH (default home directory for all users). An SSH key is the only authentication (Administrators use 2FA). Please let the admins know (hep-wheel AT caltech.edu) in case of issues. Login nodes: login-1.ultralight.org, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":11,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"_links":{"self":[{"href":"https:\/\/tier2.hep.caltech.edu\/index.php?rest_route=\/wp\/v2\/pages\/141"}],"collection":[{"href":"https:\/\/tier2.hep.caltech.edu\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/tier2.hep.caltech.edu\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/tier2.hep.caltech.edu\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/tier2.hep.caltech.edu\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=141"}],"version-history":[{"count":9,"href":"https:\/\/tier2.hep.caltech.edu\/index.php?rest_route=\/wp\/v2\/pages\/141\/revisions"}],"predecessor-version":[{"id":217,"href":"https:\/\/tier2.hep.caltech.edu\/index.php?rest_route=\/wp\/v2\/pages\/141\/revisions\/217"}],"up":[{"embeddable":true,"href":"https:\/\/tier2.hep.caltech.edu\/index.php?rest_route=\/wp\/v2\/pages\/11"}],"wp:attachment":[{"href":"https:\/\/tier2.hep.caltech.edu\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=141"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}