Resources – CMS Caltech Tier2

Our main cluster is operated 24×7, supporting both remote data operations for grid analysis and production computation for the Compact Muon Solenoid experiment at the Large Hadron Collider. This cluster is orchestrated with the latest version of HTCondor and hosts more than 8PB of raw disk space in HDFS.

The CPUs deployed provide more than 10,000 job slots with an aggregate computing power of more than 100,000 HS06 units. A collection of 10 load-balanced GridFTP transfer nodes with 10 Gbit/sec network interfaces serve physics datasets to users or exchange them with other grid sites. All of the data transfer nodes also run the XrootD framework, providing on-demand access for grid users to physics datasets hosted at our site.

Available computing resources

All Tier2 cluster is running latest CentOS 7 and for grid resources, users and jobs are allowed to utilize and use docker and singularity images. For distributing images, it can be put on the local disk or CernVM-File System, which is available on all compute nodes. Available computing resources described below:

Manufacturer	Processor / Socket Count	Batch slots per node	Available Memory per node	Number of nodes	Total Batch slots
SuperMicro	2x Xeon E5630 2.53GHz	16	40960	19	304
SuperMicro	2x Xeon L5520 2.27GHz	16	40960	16	256
SuperMicro	2x Xeon L5630 2.53GHz	16	40960	10	160
SuperMicro	2x Xeon L5640 2.26GHz	24	61440	32	768
SuperMicro	2x AMD Opteron 6378	32	81920	43	1376
SuperMicro	2x Xeon E5 2670 2.6GHz	32	81920	22	704
SuperMicro	2x Xeon E5 2670 V2 2.60GHz	40	102400	22	880
SuperMicro	2x Xeon E5 2670 V3 2.30GHz	48	122880	8	384
SuperMicro	2x Xeon E5 2660 2.2GHz	32	81920	11	352
SuperMicro	2x Xeon E5 2650 2GHz	32	81920	3	96
SuperMicro	2x Xeon E5 2650 V3 2.30GHz	40	102400	31	1240
SuperMicro	2x Xeon E5 2650 V4 2.2GHz	48	122880	32	1536
SuperMicro	2x Xeon E5 2687W V4 3.0 GHz	48	122880	9	432
SuperMicro	2x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz	80	204800	8	640
SuperMicro	2x Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz	64	163840	1	64
SuperMicro	2x Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz	88	225280	1	88
SuperMicro	2x Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz	72	184320	2	144
Supermicro	4x Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz	160	409600	4	640
Sum	—-	888	2273280	274	10064

Networking

The Caltech HEP and Network team also have it’s own AS (32361) and manages four Class C IPv4 and one /48 IPv6 network to support its leading-edge network developments carried out with many sciences and computer science research teams, R&E network partners in the US, Europe, Latin America, and Asia. List of all managed networks are below:

Network	Main ASN Owner
192.84.86.0/24	AS32361 Ultralight
198.32.43.0/24	AS32361 Ultralight
198.32.44.0/24	AS32361 Ultralight
198.32.45.0/24	AS32361 Ultralight
131.215.207.0/24	AS31 California Institute of Technology
2607:f380:a4f::/48	AS2152 California State University, Office of the Chancellor

Available Storage

The Caltech team maintains HDFS distributed file system that handles large data sets running on commodity hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. Hadoop is spread between computing nodes and JBODs and currently, there are more than 240 nodes (with range of disks from 3 to 60 per node). For the future, we continue to separate storage from compute nodes for higher availability. Overall, Tier2 maintains more than 8PB of RAW disk space, while replication is set to 2. Future upgrades are planned to use erasure coding and increase the usable disk space.

The Caltech group has implemented CEPH as a storage system for user workloads and using it as home directory also benefitting from a stable, high-performance parallel filesystem at the Caltech Tier2. We host the necessary centrally produced NANOAOD datasets locally, such that typical physics analyses can proceed without generating significant network traffic or reliance on external resources. While the current system based on enterprise-grade hard disks has peaked at a throughput of up to 2 GB/s, generated by a highly parallel analysis workload, future use of solid-state drives is expected to further boost the science capability of the system. CEPH Cluster consists of an overall 288TB. There are 6 CEPH Monitors, 1 MDS, and 3 storage nodes each with:
2 x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz with 504 GB Memory.
12 x 8 TB SAS Disks; (Disks are used for CephFS)
2 x 6.4 TB Intel SSD DC P4608 (SSDs are used for Block devices)
2 x Mellanox 40G NICs (in Bond mode)

The Caltech group together with the UCSD team have deployed combined cache across Caltech and UCSD (SoCal) and currently extended it to 1.5PB. By allowing distributed cache across site boundaries among sites, we economized on the amount of disk space used and needed for serving data analysis for two CMS Tier2 sites: Caltech and UCSD. Currently, cache pre-caches all MINIAOD and AOD of RunII to speed up user analysis running at these 2 sites. We also see a very good cache hit/miss ratio (average 10/1 with peaks to 15/1)

GPU iBanks Cluster

The iBanks cluster at Caltech consists of 4 nodes. 2 nodes, donation of Supermicro are hosting 2 x 8 GTX1080 donation of NVidia. A third Supermicro server hosts 6 Nvidia Titan Xp. 4th – Supermicro server and currently hosts 2 Nvidia Titan X. The nodes are connected with a dual 10G network and can be used in conjunction with MPI or other distributed computing mechanisms. The primary purpose of the cluster is to boost the training of deep neural networks, for which computations are highly parallelizable. Several projects, within Caltech and collaborating groups (such as Fermilab, LBL, CERN, Bari) since 2015 have greatly benefited from these resources. The cluster has served during machine learning tutorial hosting with up to 30 students concurrently. The nodes were used for prototyping applications that were deployed on ALCF Cooley, ORNL Titan and CSCS Piz Daint supercomputers.

SDN Development testbed

The Tier2 and SDN testbed together host a network of multiple switch and routers from vendors, including Dell, Arista, Mellanox, and Inventec, including several 32 X 100GE SDN- capable switches, and server platforms with sets of NVMe SSDs, disk arrays and 40GE and 100GE network interfaces used in the SANDIE, SENSE, and SDN NGenIA R&D projects.
The Caltech HEP and Network team also have its own AS and three Class C networks to support its leading-edge network developments carried out with many sciences and computer science research teams, R&E network partners in the US, Europe, Latin America, and Asia.

The SDN testbed is leveraged with network hardware loaned or donated by various industry partners. For example, Ciena installed a Wave server data center interconnect to support a 200G alien wave on the dark fiber connecting the campus to the Equinix Point of Presence in Los
Angeles where Caltech peers with the CENIC, Internet2 and ESnet networks, for the Supercomputing 2018 conference. In recent years Ciena loaned the Caltech HEP and Network team a DWDM metro setup connecting the clusters on the Pasadena campus to the Los Angeles PoP where and Level3 loaned a dark fiber pair between the campus and Los Angeles.