Constructing a supercomputer is at all times difficult, however creating the business’s first exascale-class system is an encounter with one thing wholly surprising and requires loads of work with {hardware} and software program. Sadly, this may be occurring with Oak Ridge Nationwide Laboratory’s Frontier supercomputer, which may barely final a day with out quite a few {hardware} failures.
ORNL’s Frontier is the business’s first system designed to ship as much as 1.685 FP64 ExaFLOPS peak efficiency utilizing AMD’s 64-core EPYC Trento processors, Intuition MI250X compute GPUs, and HPE’s Slingshot interconnections at 21 MW of energy. HPE constructed the system and used the Cray EX (opens in new tab) structure designed for scale-out functions, primarily for ultra-fast supercomputers.
Whereas on paper, the Frontier supercomputer seems to be exceptionally good, and {hardware} elements of the machine system have been delivered, it looks as if issues with {hardware} preserve chasing the machine from coming on-line and being accessible to researchers requiring efficiency of round 1 FP64 ExaFLOPS.
“We’re working by way of points in {hardware} and ensuring that we perceive (what they’re),” stated Justin Whitt, program director for the Oak Ridge Management Computing Facility (OLCF), in an interview with InsideHPC (opens in new tab). “You’re going to have failures at this scale. Imply time between failure on a system this dimension is hours, it’s not days.”
Rumors about potential {hardware} failures of Frontier have been floating round for fairly some time now. Some stated that the system skilled issues with the Slingshot interconnect, based on one other InsideHPC (opens in new tab) story. As well as, others indicated that AMD’s Intuition MI250X compute GPUs weren’t as dependable as anticipated this 12 months. Keep in mind that the X model, with the next variety of stream processors and excessive clocks, is simply accessible to pick out prospects.
Mr. Whitt didn’t affirm that the system experiences any specific points with Intuition or Slingshot, however he pressed that the machine suffers from quite a few {hardware} points.
“A whole lot of challenges are centered round these [GPUs], however that’s not nearly all of the challenges that we’re seeing,” the pinnacle of OLCF stated. “It’s a fairly good unfold amongst frequent culprits of elements failures which were an enormous a part of it. I don’t assume that at this level that we’ve got loads of concern over the AMD merchandise.”
Oak Ridge Nationwide Laboratory’s Frontier supercomputer is by far not the one system round to make use of HPE’s Cray EX structure with Slingshot interconnects, AMD’s EPYC CPUs and AMD’s Intuition compute GPUs. For instance, Finland’s Lumi supercomputer (Cray EX, EPYC Milan, Intuition MI250X compute GPUs) delivers 550 PetaFLOPS peak efficiency and is formally ranked because the world’s third strongest supercomputer. Maybe, the issue is legitimate with the dimensions of the machine that makes use of 60 million elements in whole.
Solely time will inform whether or not the Frontier supercomputer that was initially promised to return on-line in 2022 can be accessible to researchers beginning in 2023, on condition that it’s nonetheless not formally deployed.