AI Chips: GPU, TPU, and NPU

GPU, TPU, and FPGA

AI models like deep learning are compute-intensive. The right architecture is needed for AI and a high quantity of cores is required to process computations at scale. More specifically, AI hardware must be able to perform thousands of multiplications and additions in a mathematical process called matrix multiplication. Instead of describing what is matrix multiplications, it’s better to show it.

Matrix Multiplication

source: Marica Levitus at AZU

Matrix multiplication plays a big part in neural network computations, especially if there are many layers and nodes. In the illustration below, there are six layers, 25 nodes, and thousands of connections between the nodes, by the looks of it. This would be a considered a compute-heavy model.

source: Phys.org

Matrix multiplication is the operation that cofounds the CPU. CPU’s were designed as general-purpose chips that can run any application on the desktop or server. At the most basic level, the CPU is able to process one task at a time. The GPU makes up for the deficiency in CPU architecture by adding thousands of CUDA cores and hundreds of Tensor cores, depending on the card, that can process thousands of tasks in parallel.

However, for some like Google, the GPU is still too general-purpose to run AI workloads efficiently. Thus, Google has developed its own AI hardware known as the TPU. Google describes the TPU as a “custom ASIC chip-designed from the ground up for machine learning workloads” and it currently runs many of its services like Gmail and Translate.

Google explains the reason they decided to create the TPU. In short, the CPU architecture is based on the von Neumann architecture. Thus, “every single calculation” of the CPU is stored in L1 cache (memory). This creates the “von Neumann bottleneck” when memory must be accessed. The ALU (arithmetic logic units) is a circuit that controls memory access and it executes each transaction “one by one”. The graphic below shows the relationship between the CPU, ALU, and memory.

von Neumann Architecture

source: Wikipedia

The GPU solves this problem by throwing thousands of ALU’s and cores at the problem. However, even though GPUs process thousands of tasks in parallel, the von Neumann bottleneck is still present – one transaction at a time per ALU. Google solved the bottleneck problem inherent in GPU’s by creating a new architecture called systolic array. In this setup, ALU’s are connected to each other in a matrix. They call this arrangement the Matrix Processor. In one particular configuration, there are 32,768 ALU’s.

Not to be outdone, Microsoft has also designed its own AI architecture based on FPGA (field-programmable gate arrays). They call their hardware the Neural Processing Unit (NPU) and it powers Azure and Bing. Although FPGA has been around for decades, Microsoft has figured out a way to scale the architecture for deep learning workloads. Their deep learning platform is called Project Brainwave. The team behind Project Brainwave said the flexible architecture of FPGA allows them to write “algorithms directly on the hardware” thus bypassing “less efficient software as the middle man”. The best part, the architecture allows them to reprogram the hardware at “a moment’s notice” if changes are required due to advances in technology.

AI Chips: GPU, TPU, and NPU

Categories

GPU, TPU, and FPGA