Monitoring

Local monitoring

On most of the T2 nodes, a service called “telegraf” is running and collecting various metrics into a time series database. Telegraf collects information on the CPU and memory usage of the machines and can also analyze log files for events. There are more detailed graphs available on sensei3 (You need to be present in cmscaltech GitHub repo users to access monitoring URLs).

General T2 dashboards: https://sensei3.hep.caltech.edu:3000/d/iQICM3dmk/jooseps-dashboard
HDFS and Transfer Services: https://sensei3.hep.caltech.edu:3000/d/000000035/hdfs-mon?refresh=5m&orgId=1
Condor monitoring: https://sensei3.hep.caltech.edu:3000/d/jL88TdcWz/condor?orgId=1
GPU Monitoring: https://sensei3.hep.caltech.edu:3000/d/XNi-A4Gik/gpus-mon?orgId=1

CMS/CERN-side monitoring

Every site has a number of monitors running from CMS/CERN. Most of these require a valid grid proxy in the browser. The central entry point for these dashboards is https://monit-grafana.cern.ch/

Site Readiness: https://test-cmssst.web.cern.ch/sitereadiness/report.html#T2_US_Caltech
Site Availabilityhttp://wlcg-sam-cms.cern.ch/templates/ember/#/historicalsmry/heatMap?profile=CMS_CRITICAL_FULL&site=T2_US_Caltech&time=24h
HammerCloud (HC): simple end-to-end CMS jobs are automatically submitted to the site: https://monit-grafana.cern.ch/d/cmsTMGlobal/cms-tasks-monitoring-globalview?from=now-24h&orgId=11&to=now&var-user=sciaba&var-site=All
Global pool (glidein-WMS): https://cms-gwmsmon.cern.ch/totalview/T2_US_Caltech
Phedexhttps://cmsweb.cern.ch/phedex/prod/Activity::QualityPlots
XRootD: https://monit-grafana.cern.ch/d/000000422/xrootd-transfers-30-days
FTS dashboard: in case of issues with transfers: 
https://monit-grafana.cern.ch/d/000000420/fts-transfers-30-days
CERN FTS3: https://fts3.cern.ch:8449/fts3/ftsmon/#/
FNAL FTS3: https://cmsfts3.fnal.gov:8449/fts3/ftsmon/#/
UK-RAL FTS3: https://lcgfts3.gridpp.rl.ac.uk:8449/fts3/ftsmon/#/