Intel has detailed the corporate’s Ponte Vecchio Xe-HPC GPU at Scorching Chips 34. Within the supplied benchmarks, the chipmaker claims that Ponte Vecchio delivers as much as 2.5x extra efficiency than the Nvidia A100. However, as customary, take vendor-provided benchmarks with a pinch of salt.
Ponte Vecchio outperformed the A100 by important margins in a number of Intel-selected benchmarks. Intel’s powerhouse additionally flaunted a 2x lead in miniBUDE and 1.5x in ExaSMR. It is an attention-grabbing comparability contemplating that Ponte Vecchio is not even out but, and A100 (Ampere) has been available on the market since 2020. And let’s not neglect that AMD’s Intuition MI250X (Aldebaran) is reportedly thrice quicker than the A100. So Intel ought to fear about AMD and Nvidia’s next-generation HPC merchandise.
If Intel’s numbers are correct, Ponte Vecchio might be a possible competitor towards Nvidia’s next-generation H100 (Hopper). Primarily based on the specs we’ve got thus far, H100 must be a minimum of twice as quick because the A100, what’s much more menacing in AMD’s Intuition MI300, fusing each Zen 4 CPU and CDNA 3 GPU chiplets right into a single product. Dubbed because the world’s first knowledge heart APU, AMD claims that the Intuition MI300 represents an 8x uplift in AI coaching efficiency in comparison with the Intuition MI250X.
Ponte Vecchio will are available three flavors: OAM, x4 subsystem with Xe hyperlinks, and x4 subsystem with Xe hyperlinks on a dual-socket Sapphire Rapids platform. Sadly, Sapphire Rapids has suffered so many delays that it isn’t humorous anymore. Barring additional setbacks, some Sapphire Rapids merchandise may lastly debut in October. Nonetheless, the high-volume chips could not arrive till February 2023.
In its OAM kind issue, Ponte Vecchio boasts help for each 4 GPU and eight GPU platforms. A two-stack Ponte Vecchio configuration pumps out 52 TFLOPs of FP32 and FP64 efficiency. For comparability, a single H100 SXM5 module peaks at 60 TFLOPs of FP32 and 30 TFLOPs of FP64 efficiency.
Ponte Vecchio incorporates a 64MB register file, outputting as much as 419 TBps of bandwidth. The L1 and L2 caches are 64MB and 408MB, respectively. The massive L2 cache on Ponte Vecchio advantages particular workloads, similar to 2D-FFT Case and DNN Case. Within the presentation, Intel’s outcomes reveal substantial efficiency enchancment from 80MB to 408MB in each situations.