When Tachyum unveiled the idea of its Prodigy Common Processor at Sizzling Chips 18, it made fairly a splash with a chip designed to run any code utilizing a dynamic binary translator. It demonstrated excessive efficiency when executing each native and translated code. It took the corporate some time to design the precise {hardware}, taking pre-orders on analysis kits (opens in new tab); the corporate additionally discloses the precise specs of its Prodigy. They definitely look spectacular, however they’re additionally scary with a 950W thermal design energy per chip.
Formidable Efficiency at Formidable Energy
Every Tachyum Prodigy processor has as much as 128 proprietary cores mated with 16 DDR5 reminiscence channels (for a 1,024-bit interface) supporting as much as 7200 MT/s information switch fee (and due to this fact offering as much as 921.6 GBps of bandwidth) in addition to 64 PCIe 5.0 lanes. As well as, the chip helps as much as 8TB of DDR5 reminiscence in complete, which is consistent with what we’ll see with upcoming server CPUs from different makers. As for clock charges, Tachyum’s Prodigy is designed to run as much as 5.7 GHz and is a product of TSMC’s performance-optimized N5P course of expertise.
On the subject of efficiency, Tachyum expects its flagship Prodigy T16128-AIX processor (opens in new tab) to supply as much as 90 FP64 TFLOPS for HPC in addition to as much as 12 ‘AI PetaFLOPS’ for inference and coaching, presumably when operating native code and consuming as much as 950W (and utilizing liquid cooling), based on specs printed (opens in new tab) by the corporate and at Golem.de (opens in new tab). In the meantime, Tachyum’s Prodigy processors can work in 2-way and 4-way configurations. To place the numbers into context, AMD’s Intuition MI250X has a peak throughput of 96 FP64 TFLOPS for HPC at about 560W. In distinction, Nvidia’s H100 SXM5 can present as much as 20 INT8/FP8 PetaOPS/PetaFLOPS for AI (as much as 40 PetaOPS/PetaFLOPS with sparsity) at 700W. But, neither compute GPUs perform for general-purpose workloads. And that is precisely when it will get fascinating.
A New CPU Is Born
Tachyum’s Prodigy is a common homogeneous processor packing as much as 128 proprietary 64-bit VLIW cores that characteristic two 1024-bit vector models per core and one 4096-bit matrix unit per core. As well as, every core incorporates a 64KB instruction cache, a 64KB information cache, 1MB L2 cache, and might make the most of unused L2 caches of different cores as a sufferer L3 cache.
Tachyum’s VLIW cores are in-order cores, however when compiler makers correct optimizations, they’ll assist 4-way out-of-order points, based on Radoslav Danilak, chief government and co-founder of Tachuym, who spoke with Golem.de (opens in new tab). He additionally re-emphasized that the Prodigy instruction set structure can obtain a really excessive instruction stage parallelism with software program utilizing so-called poison bits.
These cores run native code written and explicitly optimized for Prodigy (the place VLIW structure guarantees to shine) in addition to x86, Arm, and RISC-V binaries utilizing software program emulation and with out efficiency degradation, based on the corporate. Traditionally, all makes an attempt to make VLIW processors execute x86 code have failed (e.g., Transmeta’s Crusoe, Intel’s Itanium) primarily due to explicit CPU architectures and emulation inefficiencies. The pinnacle of Tachyum admits that Qemu binary translation degrades efficiency by 30% to 40% (with out disclosing any baselines) however hopes that real-world efficiency will nonetheless be excessive sufficient to be aggressive. In the meantime, some applications are already supported natively.
“We assist GCC and Linux natively, and FreeBSD now additionally runs [on Prodigy],” mentioned Danilak. “Apache, MongoDB or Python already run natively, Pytorch and Tensorflow frameworks are additionally accessible.”
Tachyum stresses that Prodigy isn’t an accelerator however an precise CPU that may compete in opposition to AMD, Intel, and others. To make sure that the processor can ship aggressive efficiency throughout common objective and AI workloads, the corporate has made quite a few alterations to its design implementation since its first introduction in 2018.
“We’re a CPU alternative and never an AI accelerator firm, we’re concentrating on cloud/hyperscalers and telcos,” mentioned Danilak. “Over time we plan to win some supercomputer prospects, so we doubled the width of the vector/MAC models from 512 bits to 1,024 bits [which also brings in necessary data paths for the 4,096-bit matrix operations for artificial intelligence].”
Certainly, one explicit benefit that Tachyum’s Prodigy guarantees is its capacity to execute a unique form of code. Assuming that it may present first rate efficiency at first rate energy whereas executing general-purpose workloads (situations), it might give some extra flexibility to AWS, Microsoft Azure, and the likes since they may be capable of use the identical machines for AI, HPC, and general-purpose situations if wanted. It’s going to, in fact, require some precise software program work from totally different events, however this would possibly work, a minimum of in idea.
Nonetheless Not Right here
It must be famous that Tachyum nonetheless doesn’t have any Prodigy silicon. Consequently, all efficiency projections are a product of simulations, and the one factor the corporate has now’s an FPGA prototype of its processor.
In the meantime, the corporate not too long ago started to take pre-orders on Tachyum’s Prodigy Analysis Platform, which is able to use on some Prodigy silicon. Firms should place orders earlier than July 31, 2022, and supply of precise {hardware} is round ‘six to 9 months after receipt of order.’
Tachyum expects to tape out the primary Prodigy silicon (which might be smaller than 500 mm^2) in mid-August if every part goes as deliberate. After that, the corporate expects to get the primary samples of its chip round December, and if the chip works appropriately, the corporate plans to start out sampling (i.e., ship out analysis kits). Usually, silicon bring-up takes a few yr after the preliminary chip returns from the fab. Nonetheless, Tachyum hopes its first processor will work as deliberate, and it will likely be capable of kick off precise mass manufacturing within the first half of 2023.
Sooner or later, Danilak envisions a Prodigy 2 processor made utilizing considered one of TSMC’s N3 nodes that may ship twice larger efficiency on the identical energy together with PCIe Gen6 assist.