[gpu] NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

I'm running an AWS EC2 g2.2xlarge instance with Ubuntu 14.04 LTS. I'd like to observe the GPU utilization while training my TensorFlow models. I get an error trying to run 'nvidia-smi'.

ubuntu@ip-10-0-1-213:/etc/alternatives$ cd /usr/lib/nvidia-375/bin
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ls
nvidia-bug-report.sh     nvidia-debugdump     nvidia-xconfig
nvidia-cuda-mps-control  nvidia-persistenced
nvidia-cuda-mps-server   nvidia-smi
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ dpkg -l | grep nvidia 
ii  nvidia-346                                            352.63-0ubuntu0.14.04.1                             amd64        Transitional package for nvidia-346
ii  nvidia-346-dev                                        346.46-0ubuntu1                                     amd64        NVIDIA binary Xorg driver development files
ii  nvidia-346-uvm                                        346.96-0ubuntu0.0.1                                 amd64        Transitional package for nvidia-346
ii  nvidia-352                                            375.26-0ubuntu1                                     amd64        Transitional package for nvidia-375
ii  nvidia-375                                            375.39-0ubuntu0.14.04.1                             amd64        NVIDIA binary driver - version 375.39
ii  nvidia-375-dev                                        375.39-0ubuntu0.14.04.1                             amd64        NVIDIA binary Xorg driver development files
ii  nvidia-modprobe                                       375.26-0ubuntu1                                     amd64        Load the NVIDIA kernel driver and create device files
ii  nvidia-opencl-icd-346                                 352.63-0ubuntu0.14.04.1                             amd64        Transitional package for nvidia-opencl-icd-352
ii  nvidia-opencl-icd-352                                 375.26-0ubuntu1                                     amd64        Transitional package for nvidia-opencl-icd-375
ii  nvidia-opencl-icd-375                                 375.39-0ubuntu0.14.04.1                             amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                                          0.6.2.1                                             amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                                       375.26-0ubuntu1                                     amd64        Tool for configuring the NVIDIA graphics driver
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ lspci | grep -i nvidia
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ 

$ inxi -G
Graphics:  Card-1: Cirrus Logic GD 5446 
           Card-2: NVIDIA GK104GL [GRID K520] 
           X.org: 1.15.1 driver: N/A tty size: 80x24 Advanced Data: N/A out of X

$  lspci -k | grep -A 2 -E "(VGA|3D)"
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
    Subsystem: XenSource, Inc. Device 0001
    Kernel driver in use: cirrus
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
    Subsystem: NVIDIA Corporation Device 1014
00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)

I followed these instructions to install CUDA 7 and cuDNN:

$sudo apt-get -q2 update
$sudo apt-get upgrade
$sudo reboot

=======================================================================

Post reboot, update the initramfs by running '$sudo update-initramfs -u'

Now, please edit the /etc/modprobe.d/blacklist.conf file to blacklist nouveau. Open the file in an editor and insert the following lines at the end of the file.

blacklist nouveau blacklist lbm-nouveau options nouveau modeset=0 alias nouveau off alias lbm-nouveau off

Save and exit from the file.

Now install the build essential tools and update the initramfs and reboot again as below:

$sudo apt-get install linux-{headers,image,image-extra}-$(uname -r) build-essential
$sudo update-initramfs -u
$sudo reboot

========================================================================

Post reboot, run the following commands to install Nvidia.

$sudo wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run
$sudo chmod 700 ./cuda_7.0.28_linux.run
$sudo ./cuda_7.0.28_linux.run
$sudo update-initramfs -u
$sudo reboot

========================================================================

Now that the system has come up, verify the installation by running the following.

$sudo modprobe nvidia
$sudo nvidia-smi -q | head`enter code here`

You should see the output like 'nvidia.png'.

Now run the following commands. $

cd ~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery
$make
$./deviceQuery

However, 'nvidia-smi' still doesn't show GPU activity while Tensorflow is training models:

ubuntu@ip-10-0-1-48:~$ ipython
Python 2.7.11 |Anaconda custom (64-bit)| (default, Dec  6 2015, 18:08:32) 
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import tensorflow as tf 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally



ubuntu@ip-10-0-1-48:~$ nvidia-smi
Thu Mar 30 05:45:26 2017       
+------------------------------------------------------+                       
| NVIDIA-SMI 346.46     Driver Version: 346.46         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   35C    P0    38W / 125W |     10MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This question is related to gpu

The answer is


I just want to thank @Heapify for providing a practical answer and update his answer because the attached links are not up-to-date.

Step 1: Check the existing kernel of your Ubuntu Linux:

uname -a

Step 2:

Ubuntu maintains a website for all the versions of kernel that have been released. At the time of this writing, the latest stable release of Ubuntu kernel is 4.15. If you go to this link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will see several links for download.

Step 3:

Download the appropriate files based on the type of OS you have. For 64 bit, I would download the following deb files:

// UP-TO-DATE 2019-03-18
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb

Step 4:

Install all the downloaded deb files:

sudo dpkg -i *.deb

Step 5:

Reboot your machine and check if the kernel has been updated by:

uname -aenter code here

My system version: ubuntu 20.04 LTS.

  • I solved this by generate a new MOK and enroll it into shim.

  • Without disable of Secure Boot, although it also really works for me.

  • Simply execute this command and follow what it suggests:

    sudo update-secureboot-policy --enroll-key
    

According to ubuntu's wiki: How can I do non-automated signing of drivers


Try pulling out the NVIDIA graphics card and reinserting it.


It may happen after your Linux kernel update, if you entered this error, you can rebuild your nvidia driver using the following command to fix:

  1. Firstly, you need to have dkms, which can automatically regenerate new modules after kernel version changes.
    sudo apt-get install dkms
  2. Secondly, rebuild your nvidia driver. Here my nvidia driver version is 440.82, if you have installed before, you could check your installed version on /usr/src.
    sudo dkms build -m nvidia -v 440.82
  3. Lastly, reinstall nvidia driver. And then reboot your computer.
    sudo dkms install -m nvidia -v 440.82

Now you can check to see if you can use it by sudo nvidia-smi.


One important fact about NVIDIA drivers that is not very well known is that its built is done by DKMS. This allows automatic rebuild in case of kernel upgrade, this happens on system startup. Because of that, it's quite easy to miss error messages, especially if you're working on cloud VM, or server without an additional IPMI/management interface. However, it's possible to trigger DKMS build just executing dkms autoinstall right after packages installation. If this fails then you'll have a meaningful error message about missing dependency or what so ever. If dkms autoinstall builds modules correctly you can simply load it by modprobe - there is no need to reboot the system (which is often used as a way to trigger DKMS rebuild). You can check an example here


None of the above helped for me.

I am using Kubernetes on Google Cloud with tesla k-80 gpu.

Follow along this guide to ensure you installed everything correctly: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus

I was missing few important things:

  1. Installing NVIDIA GPU device drivers On your NODES. To do this use:

For COS node:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

For UBUNTU node:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml

Make sure an update was rolled to your nodes. Restart them if upgrades are off.

  1. I use this image nvidia/cuda:10.1-base-ubuntu16.04 in my docker

  2. You have to set gpu limit! This is the only way the node driver can communicate with the pod. In your yaml configuration add this under your container:

    resources:
      limits:
        nvidia.com/gpu: 1
    

I tried above solutions but only the below worked for me.

sudo apt-get update
sudo apt-get install --no-install-recommends nvidia-384 libcuda1-384 nvidia-opencl-icd-384
sudo reboot

credit --> https://deeptalk.lambdalabs.com/t/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver/148


Solved the problem by re-installing CUDA:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
echo "md5sum: $(md5sum cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb)"
echo "correct: 056de5e03444cce506202f50967b0016"
dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
apt-get -qq update
apt-get -qq -y install cuda
rm cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb

In my case none of the above solutions didn't help:

Root cause: incompatible version of gcc

Solution:

1. sudo apt install --reinstall gcc
2. sudo apt-get --purge -y remove 'nvidia*'
3  sudo apt install nvidia-driver-450 
4. sudo reboot

System: AWS EC2 18.04 instance

Solution source: https://forums.developer.nvidia.com/t/nvidia-smi-has-failed-in-ubuntu-18-04/68288/4


Run the following to get the right NVIDIA driver :

sudo ubuntu-drivers devices

Then pick the right and run:

sudo apt install <version>

What I found to fix the issue regardless of kernel version, was taking the WGET options and having apt install them.

sudo apt-get install --reinstall linux-headers-$(uname -r)

Driver Version: 390.138 on Ubuntu server 18.04.4


I am working with a AWS DeepAMI P2 instance and suddenly I found that Nvidia-driver command doesn't working and GPU is not found torch or tensorflow library. Then I have resolved the problem in the following way,

Run nvcc --version if it doesn't work

Then run the following

apt install nvidia-cuda-toolkit

Hopefully that will solve the problem.


I solved "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" on my ASUS laptop with GTX 950m and Ubuntu 18.04 by disabling Secure Boot Control from BIOS.


I had to install the NVIDIA 367.57 driver and CUDA 7.5 with Tensorflow on the g2.2xlarge Ubuntu 14.04LTS instance. e.g. nvidia-graphics-drivers-367_367.57.orig.tar

Now the GRID K520 GPU is working while I train tensorflow models:

ubuntu@ip-10-0-1-70:~$ nvidia-smi
Sat Apr  1 18:03:32 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   39C    P8    43W / 125W |   3800MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      2254    C   python                                        3798MiB |
+-----------------------------------------------------------------------------+

ubuntu@ip-10-0-1-70:~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GRID K520"
  CUDA Driver Version / Runtime Version          8.0 / 7.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 4036 MBytes (4232052736 bytes)
  ( 8) Multiprocessors, (192) CUDA Cores/MP:     1536 CUDA Cores
  GPU Max Clock rate:                            797 MHz (0.80 GHz)
  Memory Clock rate:                             2500 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 3
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GRID K520
Result = PASS

I have been struggling on this issue for two days, sharing my solution here in case anyone may need it.

The VMs that I'm using are Standard N-series GPU server with 2 K80 cards on Azure platform. With Ubuntu 18.04 OS installed.

Apparently there is an update of linux kernel several days before I came across this issue, and after the update the driver stopped working.

At first, I did purge and re-install as above replies suggested. Nothing works. Out of sudden(I don't remember why I wanted to do it), I updated the default gcc and g++ version on one of my VM as following.

sudo apt install software-properties-common
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 90
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 90

Then I purged the nvidia softwares and reinstall it as instructed in official document(please choose the correct one for your system: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=deblocal) again.

sudo apt-get purge nvidia-*

Then the nvidia-smi command finally worked again.

PS:

If you are using Azure linux VM like me. The recommended way to install CUDA is actually by enabling "NVIDIA GPU Driver Extension" in the Azure portal (of course, after you have configured the correct gcc version).

I have tried this way on my another VM and It works as well.


I was getting the same error on my Ubuntu 16.04 (Linux 4.14 kernel) in Google Compute Engine with K80 GPU. I upgraded the kernel to 4.15 from 4.14 and boom the problem was solved. Here is how I upgraded my Linux kernel from 4.14 to 4.15:

Step 1:
Check the existing kernel of your Ubuntu Linux:

uname -a

Step 2:

Ubuntu maintains a website for all the versions of kernel that have 
been released. At the time of this writing, the latest stable release 
of Ubuntu kernel is 4.15. If you go to this 
link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will 
see several links for download.

Step 3:

Download the appropriate files based on the type of OS you have. For 64 
bit, I would download the following deb files:

wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb

Step 4:

Install all the downloaded deb files:

sudo dpkg -i *.deb

Step 5:
Reboot your machine and check if the kernel has been updated by:
uname -a

You should see that your kernel has been upgraded and hopefully nvidia-smi should work.