//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>
At Scorching Chips, Chinese language startup Biren has emerged from stealth, detailing a big, general-purpose GPU (GPGPU) chip supposed for AI coaching and inference within the information heart. The BR100 consists of two equivalent compute chiplets, constructed on TSMC 7 nm at 537 mm2 every, plus 4 stacks of HBM2e in a CoWoS bundle.
“We had been decided to construct bigger chips, so we needed to be artistic with packaging to make BR100’s design economically viable,” mentioned Biren CEO Lingjie Xu. “BR100’s price may be measured by higher architectural effectivity when it comes to efficiency per watt and efficiency per sq. millimeter.”
The BR100 can obtain 2 POPS of INT8 efficiency, 1 PFLOPS of BF16, or 256 TFLOPS of FP32. That is doubled to 512 TFLOPS of 32-bit efficiency when utilizing Biren’s new TF32+ quantity format. The GPU additionally helps different 16- and 32-bit codecs however not 64-bit (64-bit will not be extensively used for AI workloads exterior of scientific computing).
Utilizing chiplets for the design meant Biren may break the reticle restrict however retain yield benefits that include smaller die to scale back price. Xu mentioned that in contrast with a hypothetical reticle-sized design primarily based on the identical GPU structure, the two-chiplet BR100 achieves 30% extra efficiency (it’s 25% bigger in compute die space) and 20% higher yield.
One other benefit of the chiplet design is that the identical tapeout can be utilized to make a number of merchandise. Biren additionally has the single-chiplet BR104 on its roadmap.
The BR100 will are available OCP accelerator module (OAM) format, whereas the BR104 will come on PCIe playing cards. Collectively, 8 × BR100 OAM modules will kind “probably the most highly effective GPGPU server on this planet, purpose-built for AI,” mentioned Xu. The corporate can be working with OEMs and ODMs.
Petaflops-capable
Excessive-speed serial hyperlinks between the chiplets provide 896-GB/s bidirectional bandwidth, which permits the 2 compute tiles to function like one SoC, mentioned Biren CTO Mike Hong.
In addition to its GPU structure, Biren has additionally developed a devoted 412-GB/s chip-to-chip (BR100 to BR100) interconnect known as BLink, with eight BLink ports per chip. That is used to hook up with different BR100s in a server node.
Every compute tile has 16 × streaming processor clusters (SPCs), related by a 2D mesh-like community on chip (NOC). The NOC has multi-tasking functionality for data-parallel or model-parallel operation.
Every SPC has 16 execution items (EUs), which may be cut up into compute items (CUs) of 4, eight, or 16 EUs.
Every EU has 16 × streaming processing cores (V-cores) and one tensor core (T-core). The V-cores are general-purpose SIMT processors with a full-set ISA for general-purpose computing—they deal with information preprocessing, deal with operations like Batch Norm and ReLU, and handle the T-core. The T-core accelerates matrix multiplication and addition, plus convolution—these operations make up the majority of a typical deep-learning workload.
Biren has additionally invented its personal quantity format, E8M15, which it calls TF32+. This format is meant for AI coaching; it has the same-sized exponent (similar dynamic vary) as Nvidia’s TF32 format however with 5 further bits of mantissa (in different phrases, it’s 5 bits extra exact). This implies the BF16 multiplier may be reused for TF32+, simplifying the design of the T-core.
Xu mentioned the corporate has already submitted outcomes to the subsequent spherical of MLPerf inference scores, which needs to be obtainable within the subsequent few weeks.