process management blog posts

Running PyTorch distributed data parallel jobs on OCI GPU cluster

Blog: Oracle BPM

Oracle Cloud Infrastructure (OCI) superclusters consisting of powerful NVIDIA GPUs and low-latency, high-bandwidth RoCE v2 networks provides an ideal platform for high performance computing (HPC) and machine learning (ML) workloads. In this blog, we show how easy and versatile it is to use the preinstalled SLURM from OCI cluster network solution to run PyTorch distributed data parallel jobs on GPU instances.