Resources

Our main cluster is operated 24×7, supporting both remote data operations for grid analysis and production computation for the Compact Muon Solenoid experiment at the Large Hadron Collider. This cluster is orchestrated with the latest version of HTCondor and hosts more than 8PB of raw disk space in HDFS. 

The CPUs deployed provide more than 10,000 job slots with an aggregate computing power of more than 100,000 HS06 units. A collection of 10 load-balanced  GridFTP transfer nodes with 10 Gbit/sec network interfaces serve physics datasets to users or exchange them with other grid sites. All of the data transfer nodes also run the XrootD framework, providing on-demand access for grid users to physics datasets hosted at our site.

Available computing resources

All Tier2 cluster is running latest CentOS 7 and for grid resources, users and jobs are allowed to utilize and use docker and singularity images. For distributing images, it can be put on the local disk or CernVM-File System, which is available on all compute nodes. Available computing resources described below:

ManufacturerProcessor / Socket CountBatch slots per nodeAvailable Memory per nodeNumber of nodesTotal Batch slots
SuperMicro2x Xeon E5630 2.53GHz164096019304
SuperMicro2x Xeon L5520 2.27GHz164096016256
SuperMicro2x Xeon L5630 2.53GHz164096010160
SuperMicro2x Xeon L5640 2.26GHz246144032768
SuperMicro2x AMD Opteron 6378 3281920431376
SuperMicro2x Xeon E5 2670 2.6GHz328192022704
SuperMicro2x Xeon E5 2670 V2 2.60GHz4010240022880
SuperMicro2x Xeon E5 2670 V3 2.30GHz481228808384
SuperMicro2x Xeon E5 2660 2.2GHz328192011352
SuperMicro2x Xeon E5 2650 2GHz3281920396
SuperMicro2x Xeon E5 2650 V3 2.30GHz40102400311240
SuperMicro2x Xeon E5 2650 V4 2.2GHz48122880321536
SuperMicro2x Xeon E5 2687W V4 3.0 GHz481228809432
SuperMicro2x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz802048008640
SuperMicro2x Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz64163840164
SuperMicro2x Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz88225280188
SuperMicro2x Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz721843202144
Supermicro4x Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz1604096004640
Sum—-888227328027410064

Networking

The Caltech HEP and Network team also have it’s own AS (32361) and manages four Class C IPv4 and one /48 IPv6 network to support its leading-edge network developments carried out with many sciences and computer science research teams, R&E network partners in the US, Europe, Latin America, and Asia. List of all managed networks are below:

NetworkMain ASN Owner
192.84.86.0/24AS32361 Ultralight
198.32.43.0/24AS32361 Ultralight
198.32.44.0/24AS32361 Ultralight
198.32.45.0/24AS32361 Ultralight
131.215.207.0/24AS31 California Institute of Technology
2607:f380:a4f::/48AS2152 California State University, Office of the Chancellor

Available Storage

The Caltech team maintains HDFS distributed file system that handles large data sets running on commodity hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. Hadoop is spread between computing nodes and JBODs and currently, there are more than 240 nodes (with range of disks from 3 to 60 per node). For the future, we continue to separate storage from compute nodes for higher availability. Overall, Tier2 maintains more than 8PB of RAW disk space, while replication is set to 2. Future upgrades are planned to use erasure coding and increase the usable disk space.

The Caltech group has implemented CEPH as a storage system for user workloads and using it as home directory also benefitting from a stable, high-performance parallel filesystem at the Caltech Tier2. We host the necessary centrally produced NANOAOD datasets locally, such that typical physics analyses can proceed without generating significant network traffic or reliance on external resources. While the current system based on enterprise-grade hard disks has peaked at a throughput of up to 2 GB/s, generated by a highly parallel analysis workload, future use of solid-state drives is expected to further boost the science capability of the system. CEPH Cluster consists of an overall 288TB. There are 6 CEPH Monitors, 1 MDS, and 3 storage nodes each with:
2 x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz with 504 GB Memory.
12 x 8 TB SAS Disks; (Disks are used for CephFS)
2 x 6.4 TB Intel SSD DC P4608 (SSDs are used for Block devices)
2 x Mellanox 40G NICs (in Bond mode)

The Caltech group together with the UCSD team have deployed combined cache across Caltech and UCSD (SoCal) and currently extended it to 1.5PB. By allowing distributed cache across site boundaries among sites, we economized on the amount of disk space used and needed for serving data analysis for two CMS Tier2 sites: Caltech and UCSD. Currently, cache pre-caches all MINIAOD and AOD of RunII to speed up user analysis running at these 2 sites. We also see a very good cache hit/miss ratio (average 10/1 with peaks to 15/1)

GPU iBanks Cluster

The iBanks cluster at Caltech consists of 4 nodes. 2 nodes, donation of Supermicro are hosting 2 x 8 GTX1080 donation of NVidia. A third Supermicro server hosts 6 Nvidia Titan Xp. 4th – Supermicro server and currently hosts  2 Nvidia Titan X. The nodes are connected with a dual 10G  network and can be used in conjunction with MPI or other distributed computing mechanisms. The primary purpose of the cluster is to boost the training of deep neural networks, for which computations are highly parallelizable. Several projects, within Caltech and collaborating groups (such as Fermilab, LBL, CERN, Bari) since 2015 have greatly benefited from these resources. The cluster has served during machine learning tutorial hosting with up to 30 students concurrently. The nodes were used for prototyping applications that were deployed on ALCF Cooley, ORNL Titan and CSCS Piz Daint supercomputers.

SDN Development testbed

The Tier2 and SDN testbed together host a network of multiple switch and routers from vendors, including Dell, Arista, Mellanox, and Inventec, including several 32 X 100GE SDN- capable switches, and server platforms with sets of NVMe SSDs, disk arrays and 40GE and 100GE network interfaces used in the SANDIE, SENSE, and SDN NGenIA R&D projects.
The Caltech HEP and Network team also have its own AS and three Class C networks to support its leading-edge network developments carried out with many sciences and computer science research teams, R&E network partners in the US, Europe, Latin America, and Asia.

The SDN testbed is leveraged with network hardware loaned or donated by various industry partners. For example, Ciena installed a Wave server data center interconnect to support a 200G alien wave on the dark fiber connecting the campus to the Equinix Point of Presence in Los
Angeles where Caltech peers with the CENIC, Internet2 and ESnet networks, for the Supercomputing 2018 conference. In recent years Ciena loaned the Caltech HEP and Network team a DWDM metro setup connecting the clusters on the Pasadena campus to the Los Angeles PoP where and Level3 loaned a dark fiber pair between the campus and Los Angeles.