Install CUDA drivers on Azure VM
I took me a working day to figure this out, so you don’t have to. I was given an Azure VM with T4 GPU. The task is to make the GPU usable by installing and correctly configuring the NVIDIA drivers.
First of all, here are the things I tried that all did not work.
Failed attempt 1: Via the Azure web interface, install the NVIDIA GPU Driver Extension.
It will returns the following error:
{"code":"DeploymentFailed","target":"...","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","details":\[{"code":"ResourceDeploymentFailure","target":"...","message":"The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'."}]}
Failed attempt 2: Install the drivers manually
The whole process went through. However, there was a weird sub-step in step 2, where it asks to set a password that you will never has a chance to use because you cannot log into the BIOS screen when the VM boots.
Eventually, when the process finishes, I still got the following error
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
However, this gives a hint that Secure Boot is the cause of the whole thing. It turns out that, with Secure Boot enabled, drivers (or firmwares?) need to be “signed” for security reasons. However, signing is such a complicated process that I never got it to work. Therefore, the solution requires disabling Secure Boot and forget about security.
The solution: Disable Secure Boot
I first disabled Secure Boot in our VM. Then, I uninstalled all nvidia drivers that I have installed during previous attemps, using the following command:
$ sudo apt-get remove --purge '^nvidia-.*'
After that, I just followed Azure’s instructions closely, including the two reboots. Note what they say about changing the link in step 3 if you are not on Ubuntu 22.04!
Finally, it works 🎆.
Concluding remarks
- The discussion in this Github issue is the game changer for me, where their issue was exactly the same as mine and people described their process in great detail.
- During the trouble shooting process, I relied on both external knowledge (google search + chatgpt) and my intuition. However, I would have reached the solution faster if I paid more attention to the search and look for posts where people encountered exactly the same problem as mine (as the Github issue above). However, somehow I did not believe there will be such a post existed, which is unreasonable because Azure VM and its GPU service is pretty popular.
- It is such a hassle to configure these stuffs :) I hope the benefits will outweigh these annoying steps.