Neural Network and AI Bottlenecks

19 min readJul 10, 2023

This is a relatively high-level and simplified view of where the limitations are in modern Deep Learning (DL) based AI stack. Compute can be thought of as composed of arithmetic operations and memory access. In recent times, arithmetic throughput has increased significantly, but memory access throughput has lagged behind. This divergence has resulted in low utilization of hardware resources at the application level.

DL applications are unique in that they require huge amounts of data and use relatively simple matrix or linear algebra that does not require control flow. These characteristics make GPUs with thousands of cores ideal for DL parallelization.

Deep Learning application training requires repeating the same computation using the back-propagation algorithm over many iterations that uses new data from a dataset over many epochs that reuses the same dataset. That then slowly converges to a set of weights( matrices) that can be used for inference. On a single GPU, that process can take months.

Another form of parallelism called Distributed Computing, also known as High-Performance Computing (HPC) is used to bring down the convergence time. A cluster of 1000's of GPUs that are connected through high bandwidth network ( e.g. multiple 25 GB/s NICs, etc.) are run in parallel. Every GPU runs the same code over the same set of data. After one or more iterations, the GPUs synchronize. This is called data parallelism and is the most widely used parallelization technique.

In recent times, AI models (GPT, etc.) have also grown in size of parameters well beyond the memory capacity of a single GPU (~ 80 GB), which has resulted in dividing a large DL network into parts over multiple GPUs. Techniques such as model parallelism, tensor parallelism, etc. are used to divide a network.

CPUs and GPUs:

GPUs are parallel processing systems. They usually use Single Instruction Multiple Thread/Data ( SIMT/D). Their cores are simpler, and 1,000s are packed into one chip. All or groups of the cores can run the same code in parallel over different parts of data.

CPU cores are more complex and can handle complex control flow ( if-then-else, branch prediction, out-of-order execution, interrupt handling, etc. ) type of execution better. The Streaming Multiprocessors (SMs) in Nvidia GPUs contain control logic. Hence, an SM and a CPU core can be considered roughly to be equivalent. However, unlike CPU cores, SMs issue instructions in order with no branch prediction or speculative execution. State-of-the-art CPUs and GPUs contain similar numbers of cores and SMs respectively. The Cuda cores are pure compute units like the ALUs in CPUs.

The CPU cores are not capable of running in parallel on the same code ( e.g. SIMD/T) as the GPU cores. Vector ( more than 1 operand ) execution is a way to address this limitation in CPU by running several streams of data in one core.

The following figures show architecture block diagrams of a single AMD Zen x86 CPU core and an SM of Nvidia H100. This particular single Zen CPU core has 4 FP units and 6 INT units. x86 AVX/SSE etc vector instructions use these SIMD units. The AMD EPYC™ 9754 CPU has 128 cores. For comparison, the Nvidia H100 GPU has 16896 FP32 CUDA Cores, 528 TensorCores, and 132 SMs. Hence, 128 CUDA cores per SM vs 4 FP units in Zen.

Memories:

Both CPUs and GPUs have a hierarchy of memories- L1 to n levels of caches between the processors and the main memory. Caches are Static RAMs. The main memory traditionally has been DDR (DRAM). Recently, the use of High Bandwidth Memory (HBM) has started gaining traction in DL/AI systems. HBMs are high-speed DRAMs made possible by the use of 1,000s of data pins placed close to the CPU or the GPU. DDRs have 100s of pins and are placed further away. SRAMs are 100x faster than DRAMs, but take more area and consumes more power, hence their use is primarily limited to internal caches. The following figure and table show the memory hierarchy (some numbers are educated guesses). With DDR5, ~ 33.5 GB/s throughput can be achieved vs 12.5 GB/s of DDR4 (Rambus). The SRAM numbers are calculated assuming FLOPS is limited by cache memory bandwidth.

The access latency for different types of memories can be hard to get. In the case of DRAMs, a single number for latency is also difficult to define, because of the complex operating principle of DRAM. DRAM operation is very different from that of SRAM and involves many more constraints.

Measuring Compute :

Operations per second (OPS), is the number of floating point ( FP32, TF32, FP64, BF16, FP16) or integer (INT8, INT4, INT3) operations that can be performed by a CPU or GPU. The most widely used measure is the number of single precision (FP32) floating operations per second (FLOPS). Although in recent years, lower precision arithmetics (BF16, FP16, INT8) have started gaining traction.

Arithmetic or Operational Intensity (OI), refers to how many arithmetic operations is executed per byte of memory access.

The roofline chart is a plot with OI on the horizontal axis and FLOPS on the vertical axis. The following figure shows a conceptual Roofline chart. The low OI region is where memory bandwidth is the limiting factor. The high OI region is where the processor cores are the limiting factor.

To fully utilize the arithmetic engines of a GPU, one will need at least the OI number of operations per byte of data!

GPU Operational Intensity:

The NVDIA V100, A100 and H100 are the 3 generations of GPUs used extensively in DL. Their significant specifications are shown in the following table.

Specifications for 3 generations of Nvidia GPUs and Graphcore IPU

The following figure shows how TOPS and memory bandwidth TB/s evolved over time from V100 to H100. FP32 TOPS tracked memory bandwidth more or less. With the introduction of TensorCores, INT8 operation performance jumped by orders of magnitude.

Evolution of TOPS and TB/s from V100 through A100 to H100.

The following figure shows the normalized OI for FP32 and INT8 operations. The normalized FP32 memory bandwidth is obtained by dividing the TB/s by 4 as an FP32 uses 4 bytes. Similarly for INT8, the normalization factor is 1, as INT8 just takes 1 byte. Although memory bandwidth is increasing, so is normalized OI. The application OI needed to fully utilize H100 arithmetic engines is 2000 for INT8 and 60 for FP32. In V100 they were 62 and 15.7 respectively for INT8 and FP32. The improvement in FP32 TOPS is entirely attributable to a bigger chip ( which in turn is due to smaller geometry — 4 nm in H100 vs 12 nm in V100 ) and higher frequency — 60/15.7 = 3.82 = 16896/5120*1.78/1.53. The INT8 acceleration is attributable to TensorCores.

Normalized Operational Intensity vs Memory Bandwidth for V100 , A100 and H100.

From the above anecdotal example, it is apparent that arithmetic execution units are limited by memory bandwidth if application OI is not high enough.

Both modern CPUs and GPUs are built with a number of accelerators. The most well-known are the TensorCores in Nvidia processors. A100 has 432 third-generation TensorCores. TensorCores are essentially units that can accelerate matrix dot products (e.g. C= A.B ) as shown below. They can compute the dot product for sub-blocks of tensor in one cycle. Hence, they can be several times faster than just using Cuda cores. TensorCores are primarily used through libraries like CuDNN, CuBLAS, etc. They can also be directly programmed through Cuda with the mma instructions. The recent Nvidia H100 also incorporates Transformer Accelerator, which accelerates the transformer architecture used in Large Language Models.

A100 TensorCore for dense and sparse dot products ( Nvidia)

Divergence of Memory Bandwidth and Arithmetic Speed:

The reasons for the divergence in arithmetic and memory throughput are many folds. Some of which are as follows.

DRAM operation requires extra overhead such as refresh cycles etc., that limits the throughput that can be achieved.

The physical distance between the processors and the memory is also a factor. DRAMs are placed external to the CPU/GPU. Hence they are separated not only by longer distance but transmitting/receiving data on and off the die ( a piece of silicon) requires more circuitry. The longer the distance between the 2 points, the lower the bit rate that can be achieved between them with the same energy. The more energy is used, the more of it is wasted as heat which then has to be removed for the chip to not “melt”. There is a well-developed branch of engineering and mathematics called Communications Theory and Technology that is devoted to the issue of transmission and reception of data and information.

The fabrication process for DRAM is also different so it is not easy to integrate DRAMs into a logic chip.

HBM/DRAMs are capacitors and hence it has been difficult to reduce their dimension below 10 nm. No such limit for the arithmetic units exists! Arithmetic units can continue to become faster and faster as they encounter less and less capacitive effect at lower line width (< 10 nm) and consequent higher frequency, whereas the former is limited by the required minimum capacitor size! However, arithmetic unit miniaturization is also beginning to come into contact with fundamental physical limitations like quantum effects at sub-nano-meter gate lengths. The heat dissipation limit has already stagnated logic chip frequency to about 2.5 GHz.

Nature of Deep Learning Arithmetic:

Deep Learning applications pass input data (tensors) through a number of layers. The layers are usually tensor dot products followed by a non-linear function called the activation function. The activation function is applied to each tensor element individually and independently. Functions that work on each element individually and independently are called element-wise. The dot product part of a layer is usually a matmul or a convolution. Element-wise functions are add, mul, relu, sigmoid, Batch Norm, Layer Norm, etc.

Matmuls and convolutions at proper sizes have high OI and hence can be arithmetic bound. This can be understood by considering the dot product between 2 NxN matrices ( matrices are 2-dimensional tensors), C = A.B . Very simply, a column of B can be read from memory into the GPU and stored in cache. Which then can be reused to compute all the elements of a column of C. As N increases, there is proportionately more reuse of the column of B, hence OI increases. To fully utilize all the cuda cores (6912 FP32 cores in A100), and achieve specified TOPS requires a large N ( ~ 10K+).

Whereas add, relu, sigmoid, softmax, etc element-wise operations are memory bound. These operations have low OI, as only 1 operation is performed for one output element. Consider the elementwise multiplication of 2 NxN matrices, C = A * B . Here, unlike the dot product case, each element of A and B are used only once to produce 1 element of C.

Elementwise mul (also known as Hadamard product ) of 2-rank tensors.

Usually during the forward pass a number of intermediate tensors are generated that are required for the backward pass. These intermediate tensors are stored in the memory, hence memory bottlenecks play a big role in application throughput.

An important property of a layer (and linear algebra) is that each element of the output tensor can be computed independently of other elements. Hence they can be made “embarrassingly” parallel. Which makes them ideal for parallel computing systems like GPUs.

In a DL model, in addition to matmuls and convolutions, a plethora of element-wise ops are required. This can be seen from the following example of a very small GPT based Language Model (LM ). The following code shows the layers in the very small GPT. Linear layers are matmuls.

The above model results in 51 different GPU kernels — gemm (matmul), layer_norm, elementwise mul, etc. Each of these kernels is also executed multiple times. Only about 25% of the end-to-end time is attributable to gemms (matmul), and the rest is used by element-wise ops, memory copy, a multitude of software/hardware overhead, etc.

Techniques to Improve OI :

One of the easiest techniques is increasing the batch size. Which increases the one dimension of the tensors. If matmul and/or convolutions dominate the application, then OI increases.

Kernel fusion is another approach to accelerate memory-bound operations. When multiple operations can be applied to the same input, the input can be loaded once from HBM, instead of multiple times for each operation, and then hold the intermediate results in cache and registers. Compilers (e.g. XLA) can automatically fuse multiple element-wise operations

Flash Attention is a technique of trading off memory for compute. Standard attention implementations with inputs Q, K, and V materializes the intermediate matrices S and P to HBM, which takes 𝑂(𝑁²) memory. The main idea behind Flash Attention is tiling and recomputing. Tiling splits the inputs Q, K, and V into blocks, load them from slow HBM to fast SRAM, then computes the attention output with respect to those blocks. By scaling the output of each block by the right normalization factor before adding them up, it is able to get the correct result in the end. The backward pass typically requires the matrices S, P ∈ R ^ (𝑁 ×𝑁 ) to compute the gradients with respect to Q, K, V. However, by storing the output O and the softmax normalization statistics (𝑚, ℓ), the algorithm can recompute the
attention matrix S and P easily in the backward pass from blocks of Q, K, V in SRAM.

S = QK^T ∈ R^(𝑁 ×𝑁)
P = softmax(S) ∈ R^(𝑁 ×𝑁 )
O = PV ∈ R^(𝑁 x d)

Techniques to Improve Memory Bandwidth:

he following table summarizes the SOTA from the 3 main vendors of HBM

AMD’s stacked HBM where the HBM memory is physically stacked on top of the GPU, thus reducing the physical distance between arithmetic engines and memory. They claim to have doubled the memory bandwidth this way.

One fundamental problem for DRAM is that it relies on storing charge in capacitors to represent bits, and shrinking capacitor sizes below a certain dimension is not possible. For this reason, DRAM manufacturing has stagnated at 10 nm. Further shrinking may require breakthroughs or fundamental shifts.

Distributed or High-Performance Compute :

The next level of parallelism involves connecting many GPUs together. There are primarily 2 motivations for this. One is to reduce the time to convergence, say from 1 month with a single GPU to 45 minutes with 1,000 GPUs. Another reason is to handle large networks with 100’s of billions of parameters (tensors). A single GPU currently only has of the order of 100 GB of memory. Just one 100 billion FP32 element (dense) tensor requires 400 GB of memory. A large neural network would have many more than just 1 tensor of that size.

There are usually 2 levels of hierarchy in distributed computing. A number of GPUs are connected by high-speed links like NVLink within a node. A node usually has 1 CPU complex with 1 to 4 CPUs. The OS views all the CPUs as one, albeit with NUMA (Non-Uniform Memory Architecture) constraint. The nodes are then connected by high-speed NICs ( e.g. 200 Gb/s Ethernet) and switches. The commonly used network architecture is called leaf-and-spine.

The software infrastructure layer most commonly used for distributed computing is Message Passing Interface (MPI). A number of MPIs have been developed over the last few decades in the context of High-Performance Computing (HPC) and they are mature technologies. There are well-developed distributed ops (also called collectives ) called allreduce, allgather, etc., that optimally perform averaging, etc operations across 100 K or even millions of processes.

The most commonly used MPI algorithm in Deep Learning is allreduce. Conceptually it is the application of a tensor operation such as averaging, summing, etc across the N processes where each process ends up with the exact same value. A naive approach would be to broadcast a copy of each process’ tensor to the rest of the processes ( e.g. N*(N-1) exchanges ). After the exchanges, each process applies a tensor operator ( e.g. average ) to all the N tensors and as a result, has the exact same resultant tensor.

The most commonly used allreduce algorithm is called ring. The nodes are considered to be in a logical ring, independent of the actual physical network connectivity. The full allreduce is then composed of reduce-scatter and allgather. A 4-rank 4-link example of these steps is shown in the following table. A rank is a process.

There are other allreduce algorithms such as Rabenseifner, Recursive Doubling, etc. There is a simple analytical expression for ring allreduce time (tar) as follows. A lot can be understood from this simple equation.

tar = 2(N-1) ( α +βM/N + γM/N) = 2(Nα +βM + γM)(N-1)/N ~ 2(Nα +βM + γM)

In the above, N- number of ranks, α — latency, β — inverse of network bandwidth, γ — the inverse of memory bandwidth, M — the size of tensor. γ essentially ignores the arithmetic overhead as these are simple elementwise ops, hence memory bound. α is a term that captures the end-to-end latency or setup time.

The network bandwidth (bwn) on each link at any point in time is as follows. Bandwidth need becomes smaller and smaller as N increases.

bwn = (M/N)/(2(N-1) ( α +βM/N + γM/N)) =

1/(2(N-1)N/M ( α +βM/N + γM/N)) =1/(2(N-1) ( αN/M +β+ γ))

~ 1/(2α(N-1) ( αN/M)) ~M/(2αN²)

The total amount of data moved is 2(N-1)*M over 2(N-1) time steps.

For large N, the latency determines the allreduce time and network bandwidth utilization. Higher the latency, the longer the allreduce time and the lower the network bandwidth utilization.

In the Nvidia DGX H100 system, there are 8 H100 GPUs. Each GPU is connected to 1 of 4 NVSwitches with 900 GB/s NVLinks. See the following figure for the conceptual block diagram of a DGX H100 system.

Block diagram of a Nvidia DGX H100 system (Nvidia)

If we just consider one DGX H100 system, hence scale-up, as an example, then β ( for FP32 ring ) can be estimated as follows. .

1/900e9 = 4.44e-12 = 1.11 ps/B,

Similarly, γ can be estimated as follows.

1/3350e9 = 0.2975 e-12 = 0.2975 ps/B,

Latency numbers are not easily available as mentioned previously. However, can be estimated by running allreduce of very small tensors/buffers. From a couple of publications ( see Reference ) the latency for 2 node DGX A100 was observed to be of the order of 50 us. Just scale-up latency would be less than that, but would not get anywhere close to pico-seconds. For our purpose, if we assume that latency scales as the inverse of the bandwidth of the slowest link, ( e.g. NIC for scale-up and NVLink for scale-out), then we get 50/(900/200) = 12.5 us. Hence, the latency factor would dominate for tensor sizes below (12.5e-6/1.4e-12)*1e-6 = 8.9 MB , for N=1 and 72 MB for N=8, and so on.

The other extreme is a mesh (all-to-all or broadcast) connection. Which minimizes the time at the expense of network and memory bandwidth. Mesh connection can be done easily in DGX H100/A100 because of the NVSwitch architecture. Each GPU can broadcast its content to the 7 other GPUs. In this case, the bandwidth available to each GPU, 900 GB/s, is shared by 7 connections.

The allreduce time (tar) and network bandwidth (bwn) can be analytically expressed as follows.

tar = α + (β + γ(N-1))M

bwn = M/(α + (β + γ(N-1))M) ~ 1/(γN)

The total amount of data moved is N(N-1)M over 1 time step. The memory bandwidth increases with N, hence would not scale well.

Plugging in the numbers for DGX H100, the following table is obtained. From there it appears that mesh performs better on smaller message/tensor sizes at a smaller scale.

When many DGX systems are connected into a cluster with switched and fiber-optic cables, additional latencies are added by each component between ranks — multiple switches, signal propagation time in cables, etc. For a 50 feet square cluster, the corner-to-corner distance itself can add ( 1.25 ns/f *100 = 125 ns) due to the speed of signal propagation. The NVIDIA Spectrum SN5000 Series Switches have latency of 0.3 micro-second. In a single spine-leaf architecture, a packet might have to traverse 4 switches, hence end-to-end latency from switches is 1.2 micro-second. These additional latencies increase the allreduce time.

Scaling Efficiency:

A common measure for scaling is scaling efficiency. Which is a measure of how much of the TOPS available is usable. It is commonly calculated as the ratio iteration time of just 1 process or no collectives to the iteration time with collectives. In the worst-case scenario, all the time for collectives gets added to the 1 process iteration time (t1iter). The scaling efficiency ( SE) is then as follows

SE = t1iter/(t1iter+tar)

Allreduce is a synchronization operation that makes all processes run somewhat in lockstep. In reality, each process may experience slightly different iteration times which again may vary from iteration to iteration ( tvar). This variation also reduces SE. There is also additional overhead involved with TensorFlow or PyTorch ( tfw ) as the tensors reside in them and have to copied to NICs etc. Hence the SE with these 3 additional overhead incorporated looks as follows.

SE = t1iter/(t1iter+tar+tfw+tvar)

From the above equation, it is apparent that for best SE, t1iter should be as high as possible. However as GPUs and memories become faster, t1iter decreases and consequently, SE decreases, unless the other 3 components also go down proportionately.

A simple technique to improve SE is to increase batch size. Batch size does not influence the weight/parameter size that needs to be allreduced ( or other collectives) but increases the iteration time. Another simple technique is to do collectives only every n-th iteration. Both of these techniques have been found to work well, although convergence could become an issue at higher values of batch size and iteration aggregation.

Another technique is to hide or overlap the overhead of collectives behind compute. In this case, only part or none is exposed or gets added to the iteration time.

End Note:

The relatively simple linear algebraic arithmetic used in Deep Learning is well suited for the large-scale parallel computing made possible by 10,000s of simpler cores of a single GPU compared to 100s of more complex cores in CPUs.

Arithmetic throughput and memory bandwidth mismatch result in low utilization of the formidable arithmetic resources available in modern GPUs. This mismatch is exacerbated by the introduction of higher-performance arithmetic units like TensorCores.

Memory bandwidth is limited by DRAM technology limitations at this point. HBM partially addresses this by adding 1000s of data pins and reducing the physical distance to the arithmetic units.

A second level of parallelism is achieved through networking 1,000s+ of GPUs through complex and expensive hardware and software. The overhead of this complex system results in the utilization of less than 100% of single GPU performance.

Used References:

Contents from the following are used in some form in the above piece.

DDR5 vs DDR4 DRAM — All the Advantages & Design Challenges, https://www.rambus.com/blogs/get-ready-for-ddr5-dimm-chipsets/#:~:text=DDR5%20Scales%20to%208.4%20GT,increase%20to%204.8%20GT%2Fs.
Roofline model, https://en.wikipedia.org/wiki/Roofline_model
COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training, https://arxiv.org/abs/2211.16648
A Systematic Methodology for Analysis of Deep
Learning Hardware and Software Platforms, https://mlsys.org/media/mlsys-2020/Slides/1435.pdf, https://github.com/Emma926/paradnn
DictFormer: Tiny Transformer with Shared Dictionary https://openreview.net/forum?id=GWQWAeE9EpB
AI chip features hardware support for transformer models, https://www.embedded.com/ai-chip-features-hardware-support-for-transformer-models/
TinyStories: How Small Can Language Models Be and Still Speak
Coherent English? , https://arxiv.org/pdf/2305.07759.pdf, https://huggingface.co/roneneldan, https://huggingface.co/roneneldan/TinyStories-1M,
FlashAttention, https://github.com/HazyResearch/flash-attention
Memory-Limited Layers User’s Guide, https://docs.nvidia.com/deeplearning/performance/dl-performance-memory-limited/index.html#mem-limited
GPU Performance Background User’s Guide, https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html
NVIDIA A100 Tensor Core GPU Architecture, https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
Ampere (microarchitecture), https://en.wikipedia.org/wiki/Ampere_(microarchitecture)
https://www.anandtech.com/show/17327/nvidia-hopper-gpu-architecture-and-h100-accelerator-announced
NVIDIA H100 Tensor Core GPU Architecture, https://resources.nvidia.com/en-us-tensor-core
NVIDIA TESLA V100 GPU ARCHITECTURE, https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
Inside the NVIDIA Ampere Architecture. https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21730-inside-the-nvidia-ampere-architecture.pdf
Nvidia’s H100: Funny L2, and Tons of Bandwidth, https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2023/06/h100_latency_vs_a100.png?ssl=1
Introduction to the NVIDIA DGX H100 System, https://docs.nvidia.com/dgx/dgxh100-user-guide/introduction-to-dgxh100.html
Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors, https://arxiv.org/pdf/2206.02874.pdf, https://github.com/sunlex0717/DissectingTensorCores
CUDA C++ Programming Guide, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation
Zen — Microarchitectures — AMD, https://en.wikichip.org/wiki/amd/microarchitectures/zen
Scaling the ‘Scaling Wall’ to Future Compute Systems, https://www.eetimes.eu/scaling-the-scaling-wall-to-future-compute-systems/
AMD Envisions Stacked DRAM on top of Compute Chiplets in the Near Future, https://www.techpowerup.com/305060/amd-envisions-stacked-dram-on-top-of-compute-chiplets-in-the-near-future
High Bandwidth Memory, https://en.wikipedia.org/wiki/High_Bandwidth_Memory
Insights into DDR5 Sub-timings and Latencies, https://www.anandtech.com/show/16143/insights-into-ddr5-subtimings-and-latencies
DRAM Bandwidth and Latency Stacks: Visualizing DRAM Bottlenecks, https://heirman.net/papers/eyerman2022dram.pdf
Memory Bandwidth Per Core and Per Socket for Intel Xeon and AMD EPYC, https://www.servethehome.com/memory-bandwidth-per-core-and-per-socket-for-intel-xeon-and-amd-epyc/
THE FUTURE OF LOW-LATENCY MEMORY ,https://objective-analysis.com/wp-content/uploads/2022/12/2021-04-18-Objective-Analysis-White-Paper-The-Future-of-Low-Latency-Memory.pdf
Optimization of Collective Communication Operations in MPICH, https://web.cels.anl.gov/~thakur/papers/ijhpca-coll.pdf
A Survey of Methods for Collective Communication
Optimization and Tuning, https://arxiv.org/pdf/1611.06334.pdf
Horovod, https://www.uber.com/blog/horovod/
Why Tree has better bandwidth performance than Ring on only 2 DGX-A100 ? https://github.com/NVIDIA/nccl/issues/812
Performance of MVAPICH2-GDR on DGX A100, http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/21/mug21_dgx2a100.pdf
Characterizing Off-path SmartNIC for Accelerating Distributed Systems, https://arxiv.org/pdf/2212.07868.pdf
Maximizing Network Performance for Storage with NVIDIA Spectrum Ethernet, https://developer.nvidia.com/blog/maximizing-network-performance-for-storage-with-nvidia-spectrum-ethernet/

Additional References:

An In-Depth Look at the Transformer Based Models, https://medium.com/@yulemoon/an-in-depth-look-at-the-transformer-based-models-22e5f5d17b6b
Matrix Multiplication Background User’s Guide, https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
GPU Performance Background User’s Guide https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#understand-perf
https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/, https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/
Hardware Accelerator Integration Tradeoffs for
High-Performance Computing: A Case Study of
GEMM Acceleration in N-Body Methods, http://lca.ece.utexas.edu/pubs/Hardware_Accelerator_Integration_Tradeoffs_for_High-Performance_Computing_A_Case_Study_of_GEMM_Acceleration_in_N-Body_Methods.pdf
Anatomy of High-Performance GEMM with Online Fault
Tolerance on GPUs, https://arxiv.org/pdf/2305.01024.pdf
Performance Analysis of GEMM Mappings in
CPU Architectures
CUTLASS 3, https://www.youtube.com/watch?v=PWWOGrLZtZg
Parallel Thread Execution ISA Version 8.1, https://docs.nvidia.com/cuda/parallel-thread-execution/
CUDA Binary Utilities, https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#maxwell-pascal
Imec Reveals Sub-1nm Transistor Roadmap, 3D-Stacked CMOS 2.0 Plans, https://www.tomshardware.com/news/imec-reveals-sub-1nm-transistor-roadmap-3d-stacked-cmos-20-plans
[DGIST Series] How the Quest for AI Led to Next-Generation Memory & Computing Processors, https://news.skhynix.com/how-the-quest-for-ai-led-to-next-generation-memory-computing-processors/
Tutorial 6: Transformers and Multi-Head Attention, https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html
In-Memory Processing ISVLSI 2022 Special Session, https://safari.ethz.ch/safari_public_wp/wp-content/uploads/ISVLSI2022_Introduction_SpecialSession.pdf
A Formal Instruction-Level GPU Model for Scalable Verification, https://www.cs.princeton.edu/~aartig/papers/iccad18-gpu.pdf
Design and Implementation of a PTX Emulation Library, https://upcommons.upc.edu/bitstream/handle/2099.1/7589/PFC.pdf
Vortex: Extending the RISC-V ISA for GPGPU and 3D-Graphics Research,https://ar5iv.labs.arxiv.org/html/2110.10857
Demystifying the Characteristics of High Bandwidth Memory
for Real-Time Systems, https://upcommons.upc.edu/bitstream/handle/2117/362010/HBM.pdf?sequence=1
DDR4: Double the speed, double the latency?
Make sure your system can handle next-generation DRAM, https://www.chipestimate.com/DDR4-Double-the-speed-double-the-latencyMake-sure-your-system-can-handle-next-generation-DRAM/Cadence/Technical-Article/2011/11/22
NVIDIA GPU P2P Benchmark bandwidth/throughput and latency, https://gist.github.com/joshlk/bbb1aca6e70b11d251886baee6423dcb
A Visual Guide to MPI All-to-all, https://tcpp.cs.gsu.edu/curriculum/?q=system/files/1786_A%20Visual%20Guide%20to%20MPI%20All-to-all.pdf
NCCL: ACCELERATED MULTI-GPU COLLECTIVE COMMUNICATIONS, https://images.nvidia.com/events/sc15/pdfs/NCCL-Woolley.pdf

Neural Network and AI Bottlenecks

Written by Subrata Goswami

Responses (1)