Our main cluster is operated 24×7, supporting both remote data operations for grid analysis and production computation for the Compact Muon Solenoid experiment at the Large Hadron Collider. This cluster is orchestrated with the latest version of HTCondor and hosts more than 8PB of raw disk space in HDFS.
The CPUs deployed provide more than 10,000 job slots with an aggregate computing power of more than 100,000 HS06 units. A collection of 10 load-balanced GridFTP transfer nodes with 10 Gbit/sec network interfaces serve physics datasets to users or exchange them with other grid sites. All of the data transfer nodes also run the XrootD framework, providing on-demand access for grid users to physics datasets hosted at our site.
Available computing resources
All Tier2 cluster is running latest CentOS 7 and for grid resources, users and jobs are allowed to utilize and use docker and singularity images. For distributing images, it can be put on the local disk or CernVM-File System, which is available on all compute nodes. Available computing resources described below:
Manufacturer | Processor / Socket Count | Batch slots per node | Available Memory per node | Number of nodes | Total Batch slots |
SuperMicro | 2x Xeon E5630 2.53GHz | 16 | 40960 | 19 | 304 |
SuperMicro | 2x Xeon L5520 2.27GHz | 16 | 40960 | 16 | 256 |
SuperMicro | 2x Xeon L5630 2.53GHz | 16 | 40960 | 10 | 160 |
SuperMicro | 2x Xeon L5640 2.26GHz | 24 | 61440 | 32 | 768 |
SuperMicro | 2x AMD Opteron 6378 | 32 | 81920 | 43 | 1376 |
SuperMicro | 2x Xeon E5 2670 2.6GHz | 32 | 81920 | 22 | 704 |
SuperMicro | 2x Xeon E5 2670 V2 2.60GHz | 40 | 102400 | 22 | 880 |
SuperMicro | 2x Xeon E5 2670 V3 2.30GHz | 48 | 122880 | 8 | 384 |
SuperMicro | 2x Xeon E5 2660 2.2GHz | 32 | 81920 | 11 | 352 |
SuperMicro | 2x Xeon E5 2650 2GHz | 32 | 81920 | 3 | 96 |
SuperMicro | 2x Xeon E5 2650 V3 2.30GHz | 40 | 102400 | 31 | 1240 |
SuperMicro | 2x Xeon E5 2650 V4 2.2GHz | 48 | 122880 | 32 | 1536 |
SuperMicro | 2x Xeon E5 2687W V4 3.0 GHz | 48 | 122880 | 9 | 432 |
SuperMicro | 2x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz | 80 | 204800 | 8 | 640 |
SuperMicro | 2x Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz | 64 | 163840 | 1 | 64 |
SuperMicro | 2x Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz | 88 | 225280 | 1 | 88 |
SuperMicro | 2x Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz | 72 | 184320 | 2 | 144 |
Supermicro | 4x Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz | 160 | 409600 | 4 | 640 |
Sum | —- | 888 | 2273280 | 274 | 10064 |
Networking
The Caltech HEP and Network team also have it’s own AS (32361) and manages four Class C IPv4 and one /48 IPv6 network to support its leading-edge network developments carried out with many sciences and computer science research teams, R&E network partners in the US, Europe, Latin America, and Asia. List of all managed networks are below:
Network | Main ASN Owner |
192.84.86.0/24 | AS32361 Ultralight |
198.32.43.0/24 | AS32361 Ultralight |
198.32.44.0/24 | AS32361 Ultralight |
198.32.45.0/24 | AS32361 Ultralight |
131.215.207.0/24 | AS31 California Institute of Technology |
2607:f380:a4f::/48 | AS2152 California State University, Office of the Chancellor |
Available Storage
The Caltech team maintains HDFS distributed file system that handles large data sets running on commodity hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. Hadoop is spread between computing nodes and JBODs and currently, there are more than 240 nodes (with range of disks from 3 to 60 per node). For the future, we continue to separate storage from compute nodes for higher availability. Overall, Tier2 maintains more than 8PB of RAW disk space, while replication is set to 2. Future upgrades are planned to use erasure coding and increase the usable disk space.
The Caltech group has implemented CEPH as a storage system for user workloads and using it as home directory also benefitting from a stable, high-performance parallel filesystem at the Caltech Tier2. We host the necessary centrally produced NANOAOD datasets locally, such that typical physics analyses can proceed without generating significant network traffic or reliance on external resources. While the current system based on enterprise-grade hard disks has peaked at a throughput of up to 2 GB/s, generated by a highly parallel analysis workload, future use of solid-state drives is expected to further boost the science capability of the system. CEPH Cluster consists of an overall 288TB. There are 6 CEPH Monitors, 1 MDS, and 3 storage nodes each with:
2 x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz with 504 GB Memory.
12 x 8 TB SAS Disks; (Disks are used for CephFS)
2 x 6.4 TB Intel SSD DC P4608 (SSDs are used for Block devices)
2 x Mellanox 40G NICs (in Bond mode)
The Caltech group together with the UCSD team have deployed combined cache across Caltech and UCSD (SoCal) and currently extended it to 1.5PB. By allowing distributed cache across site boundaries among sites, we economized on the amount of disk space used and needed for serving data analysis for two CMS Tier2 sites: Caltech and UCSD. Currently, cache pre-caches all MINIAOD and AOD of RunII to speed up user analysis running at these 2 sites. We also see a very good cache hit/miss ratio (average 10/1 with peaks to 15/1)
GPU iBanks Cluster
The iBanks cluster at Caltech consists of 4 nodes. 2 nodes, donation of Supermicro are hosting 2 x 8 GTX1080 donation of NVidia. A third Supermicro server hosts 6 Nvidia Titan Xp. 4th – Supermicro server and currently hosts 2 Nvidia Titan X. The nodes are connected with a dual 10G network and can be used in conjunction with MPI or other distributed computing mechanisms. The primary purpose of the cluster is to boost the training of deep neural networks, for which computations are highly parallelizable. Several projects, within Caltech and collaborating groups (such as Fermilab, LBL, CERN, Bari) since 2015 have greatly benefited from these resources. The cluster has served during machine learning tutorial hosting with up to 30 students concurrently. The nodes were used for prototyping applications that were deployed on ALCF Cooley, ORNL Titan and CSCS Piz Daint supercomputers.
SDN Development testbed
The Tier2 and SDN testbed together host a network of multiple switch and routers from vendors, including Dell, Arista, Mellanox, and Inventec, including several 32 X 100GE SDN- capable switches, and server platforms with sets of NVMe SSDs, disk arrays and 40GE and 100GE network interfaces used in the SANDIE, SENSE, and SDN NGenIA R&D projects.
The Caltech HEP and Network team also have its own AS and three Class C networks to support its leading-edge network developments carried out with many sciences and computer science research teams, R&E network partners in the US, Europe, Latin America, and Asia.
The SDN testbed is leveraged with network hardware loaned or donated by various industry partners. For example, Ciena installed a Wave server data center interconnect to support a 200G alien wave on the dark fiber connecting the campus to the Equinix Point of Presence in Los
Angeles where Caltech peers with the CENIC, Internet2 and ESnet networks, for the Supercomputing 2018 conference. In recent years Ciena loaned the Caltech HEP and Network team a DWDM metro setup connecting the clusters on the Pasadena campus to the Los Angeles PoP where and Level3 loaned a dark fiber pair between the campus and Los Angeles.