Parallel coaching on numerous GPUs is cutting-edge in deep studying. The open supply picture era algorithm Secure Diffusion was skilled on a cluster of 256 GPUs. Meta’s AI Analysis SuperCluster comprises greater than 24,000 NVIDIA H100 GPUs which are used to coach fashions comparable to Llama 3.
By utilizing a number of GPUs, machine studying specialists scale back the wall time of their coaching runs. Coaching Secure Diffusion took 150,000 GPU hours, or greater than 17 years. Parallel coaching decreased that to 25 days.
There are two kinds of parallel deep studying:
- Information parallelism, the place a big dataset is distributed throughout a number of GPUs.
- Mannequin parallelism, the place a deep studying mannequin that’s too giant to suit on a single GPU is distributed throughout a number of gadgets.
We’ll focus right here on information parallelism, as mannequin parallelism solely turns into related for very giant fashions past 500M parameters.
Past decreasing wall time, there may be an financial argument for parallel coaching: Cloud compute suppliers comparable to AWS provide single machines with as much as 16 GPUs. Parallel coaching can reap the benefits of all obtainable GPUs, and also you get extra worth in your cash.