Up until now, no one has been able to massively increase the amount of compute dedicated to a single model beyond the OpenAI GPT 4 model level.
Google's Gemini Ultra, Nvidia Nemotron 340B, and Meta LLAMA 3 405B had similar or slightly more compute than GPT-4, but an inferior architecture was use.d. Those models did not unlock new capabilities.
OpenAI's training BF16 FLOPS for GPT-4 21.5 million ExaFLOPs on ~20,000 A100s for 90 to 100 days. An 100k H100 cluster will have 15-31 times the compute.
A 100k H100 cluster training run for 100 days can reach 600 million ExaFLOPs. The reliability problems for hardware reduces effective compute to 35% of the theoretical level.
To understand network design, topology, reliability concerns, and checkpointing strategies we need to understand how LLM handle data and minimize data movement.
There are 3 different types of parallelism used in trillion parameter training - Data Parallelism, Tensor Parallelism, and Pipeline Parallelism.
Data Parallelism is the simplest form of parallelism in which each GPU holds the entire copy of the model weights and each GPU (rank) receives a different subset of the data. This type of parallelism has the lowest level of communication since just the gradients needs to be summed up (all reduce) between each GPU. This only works if each GPU has enough memory to store the entire model weights, activations, optimizer state. The model weights and optimizer state can take as much as 10.8 Terabytes of memory for training for GPT4.
Tensor parallelism reduces the total memory used per GPU by the number of tensor parallelism ranks. For example, it is common to use 8 tensor parallelism ranks today across NVLink so this will reduce the used memory per GPU by 8.
With Pipeline Parallelism, each GPU only has a subset of the layers and only does the computation for that layer and passes the output other the next GPU.