Decreasing model training time with GPU

How do you make sure you are making the best use of your GPU resources

Is this site saved ?

Why I looked into this?

Fine tuning by replacing the final layers was taking a long time. My bert inference was taking a lot of time.

So I looked into what all offered any scope for improvement .

Making sure all data and model parameters have been sent to GPU:

code block:

device= 'device': "cuda" if torch.cuda.is_available() else "cpu" model.to(device) input_data = data.to(device) input_labels = labels.to(device)

If any tensors were initalized , in your code, you would need to send them explicitly to GPU:

weights = Variable(torch.Tensor( input_dim), requires_grad=True).to(device)

One very important thing to remember, “Please note that just calling my_tensor.to(device) returns a new copy of my_tensor on GPU instead of rewriting my_tensor.” as has been mentioned in the pytorch documentation.

You can confirm, if your data or parameters have been sent to GPU via this :

print(input_data.is_cuda)

This cudnn benchmark=True, chooses the best alogorithm under the hood to make the most efficient use of GPUs based on your data.

torch.backends.cudnn.benchmark = True

GPU memory usage:

code block:

if device=='cuda': nvmlInit() h = nvmlDeviceGetHandleByIndex(0) info = nvmlDeviceGetMemoryInfo(h) print('percent use GPU memory' ,"{0:.0%}".format(info.used/info.total))

I could observe increasing batch size, would lead to greater GPU memory usage. Computationally GPU memory seems to be the only bottleneck.

Using GPU compute

Ideally if you have done all the previous steps before, you should be able to run your program on GPU. If you erroneously have some parts of your model, not sent on GPU, it should send you an error. This quote from the prodigal,ptrblck (NVIDIA) gave me some peace “…..If you run your code and some operations are using tensors on the GPU and CPU, you’ll get an error.”

performance on different GPUs for my task

Keeping everything constant, except changing GPU (Same data size, same batch size) Insert table:

v100 percent use GPU memory 80% Epoch: 0 Loss: 62.719778537750244 : Training epcoh took: 0:02:22

rtx8000 percent use GPU memory 53% Epoch: 0 Loss: 62.734622955322266 : Training epcoh took: 0:01:47

Using data parallel

What exactly is put on parallel ?

From pytorch documentation “DataParallel splits your data automatically and sends job orders to multiple models on several GPUs. After each model finishes their job, DataParallel collects and merges the results before returning it to you”

How is compute shared between multiple nodes ?

Looking at profiler at end to observe any bottlenecks