NVIDIA-SMI has failed because it couldn t communicate with the NVIDIA driver

Question

I m running an AWS EC2 g2 2xlarge instance with Ubuntu 14 04 LTS  I d like to observe the GPU utilization while training my TensorFlow models  I get an error trying to run  nvidia-smi    ubuntu ip-10-0-1-213  etc alternatives  cd  usr lib nvidia-375 bin ubuntu ip-10-0-1-213  usr lib nvidia-375 bin  ls nvidia-bug-report sh     nvidia-debugdump     nvidia-xconfig nvidia-cuda-mps-control  nvidia-persistenced nvidia-cuda-mps-server   nvidia-smi ubuntu ip-10-0-1-213  usr lib nvidia-375 bin    nvidia-smi NVIDIA-SMI has failed because it couldn t communicate with the NVIDIA driver  Make sure that the latest NVIDIA driver is installed and running    ubuntu ip-10-0-1-213  usr lib nvidia-375 bin  dpkg -l   grep nvidia  ii  nvidia-346                                            352 63-0ubuntu0 14 04 1                             amd64        Transitional package for nvidia-346 ii  nvidia-346-dev                                        346 46-0ubuntu1                                     amd64        NVIDIA binary Xorg driver development files ii  nvidia-346-uvm                                        346 96-0ubuntu0 0 1                                 amd64        Transitional package for nvidia-346 ii  nvidia-352                                            375 26-0ubuntu1                                     amd64        Transitional package for nvidia-375 ii  nvidia-375                                            375 39-0ubuntu0 14 04 1                             amd64        NVIDIA binary driver - version 375 39 ii  nvidia-375-dev                                        375 39-0ubuntu0 14 04 1                             amd64        NVIDIA binary Xorg driver development files ii  nvidia-modprobe                                       375 26-0ubuntu1                                     amd64        Load the NVIDIA kernel driver and create device files ii  nvidia-opencl-icd-346                                 352 63-0ubuntu0 14 04 1                             amd64        Transitional package for nvidia-opencl-icd-352 ii  nvidia-opencl-icd-352                                 375 26-0ubuntu1                                     amd64        Transitional package for nvidia-opencl-icd-375 ii  nvidia-opencl-icd-375                                 375 39-0ubuntu0 14 04 1                             amd64        NVIDIA OpenCL ICD ii  nvidia-prime                                          0 6 2 1                                             amd64        Tools to enable NVIDIA s Prime ii  nvidia-settings                                       375 26-0ubuntu1                                     amd64        Tool for configuring the NVIDIA graphics driver ubuntu ip-10-0-1-213  usr lib nvidia-375 bin  lspci   grep -i nvidia 00 03 0 VGA compatible controller  NVIDIA Corporation GK104GL  GRID K520   rev a1  ubuntu ip-10-0-1-213  usr lib nvidia-375 bin      inxi -G Graphics   Card-1  Cirrus Logic GD 5446             Card-2  NVIDIA GK104GL  GRID K520              X org  1 15 1 driver  N A tty size  80x24 Advanced Data  N A out of X     lspci -k   grep -A 2 -E   VGA 3D   00 02 0 VGA compatible controller  Cirrus Logic GD 5446     Subsystem  XenSource  Inc  Device 0001     Kernel driver in use  cirrus 00 03 0 VGA compatible controller  NVIDIA Corporation GK104GL  GRID K520   rev a1      Subsystem  NVIDIA Corporation Device 1014 00 1f 0 Unassigned class  ff80   XenSource  Inc  Xen Platform Device  rev 01    I followed these instructions to install CUDA 7 and cuDNN    sudo apt-get -q2 update  sudo apt-get upgrade  sudo reboot                                                                            Post reboot  update the initramfs by running   sudo update-initramfs -u   Now  please edit the  etc modprobe d blacklist conf file to blacklist nouveau  Open the file in an editor and insert the following lines at the end of the file   blacklist nouveau blacklist lbm-nouveau options nouveau modeset 0 alias nouveau off alias lbm-nouveau off  Save and exit from the file   Now install the build essential tools and update the initramfs and reboot again as below    sudo apt-get install linux- headers image image-extra -  uname -r  build-essential  sudo update-initramfs -u  sudo reboot                                                                             Post reboot  run the following commands to install Nvidia    sudo wget http   developer download nvidia com compute cuda 7 0 Prod local installers cuda 7 0 28 linux run  sudo chmod 700   cuda 7 0 28 linux run  sudo   cuda 7 0 28 linux run  sudo update-initramfs -u  sudo reboot                                                                             Now that the system has come up  verify the installation by running the following    sudo modprobe nvidia  sudo nvidia-smi -q   head enter code here    You should see the output like  nvidia png    Now run the following commands     cd   NVIDIA CUDA-7 0 Samples 1 Utilities deviceQuery  make    deviceQuery   However   nvidia-smi  still doesn t show GPU activity while Tensorflow is training models   ubuntu ip-10-0-1-48    ipython Python 2 7 11  Anaconda custom  64-bit    default  Dec  6 2015  18 08 32   Type  copyright    credits  or  license  for more information   IPython 4 1 2 -- An enhanced Interactive Python            - gt  Introduction and overview of IPython s features   quickref - gt  Quick reference  help      - gt  Python s own help system  object    - gt  Details about  object   use  object    for extra details   In  1   import tensorflow as tf  I tensorflow stream executor dso loader cc 135  successfully opened CUDA library libcublas so 7 5 locally I tensorflow stream executor dso loader cc 135  successfully opened CUDA library libcudnn so 5 locally I tensorflow stream executor dso loader cc 135  successfully opened CUDA library libcufft so 7 5 locally I tensorflow stream executor dso loader cc 135  successfully opened CUDA library libcuda so 1 locally I tensorflow stream executor dso loader cc 135  successfully opened CUDA library libcurand so 7 5 locally    ubuntu ip-10-0-1-48    nvidia-smi Thu Mar 30 05 45 26 2017         ------------------------------------------------------                           NVIDIA-SMI 346 46     Driver Version  346 46                                   ------------------------------- ---------------------- ----------------------    GPU  Name        Persistence-M  Bus-Id        Disp A   Volatile Uncorr  ECC     Fan  Temp  Perf  Pwr Usage Cap          Memory-Usage   GPU-Util  Compute M                                                                                        0  GRID K520           Off    0000 00 03 0     Off                    N A     N A   35C    P0    38W   125W       10MiB    4095MiB        0       Default    ------------------------------- ---------------------- ----------------------    -----------------------------------------------------------------------------    Processes                                                        GPU Memory      GPU       PID  Type  Process name                               Usage                                                                                           No running processes found                                                    -----------------------------------------------------------------------------

User · Answer

Solved the problem by re-installing CUDA  wget https   developer download nvidia com compute cuda repos ubuntu1804 x86 64 cuda-ubuntu1804 pin mv cuda-ubuntu1804 pin  etc apt preferences d cuda-repository-pin-600 wget https   developer download nvidia com compute cuda 11 1 0 local installers cuda-repo-ubuntu1804-11-1-local 11 1 0-455 23 05-1 amd64 deb echo  quot md5sum    md5sum cuda-repo-ubuntu1804-11-1-local 11 1 0-455 23 05-1 amd64 deb  quot  echo  quot correct  056de5e03444cce506202f50967b0016 quot  dpkg -i cuda-repo-ubuntu1804-11-1-local 11 1 0-455 23 05-1 amd64 deb apt-key add  var cuda-repo-ubuntu1804-11-1-local 7fa2af80 pub apt-get -qq update apt-get -qq -y install cuda rm cuda-repo-ubuntu1804-11-1-local 11 1 0-455 23 05-1 amd64 deb

User · Answer

One important fact about NVIDIA drivers that is not very well known is that its built is done by DKMS  This allows automatic rebuild in case of kernel upgrade  this happens on system startup  Because of that  it s quite easy to miss error messages  especially if you re working on cloud VM  or server without an additional IPMI management interface  However  it s possible to trigger DKMS build just executing dkms autoinstall right after packages installation  If this fails then you ll have a meaningful error message about missing dependency or what so ever  If dkms autoinstall builds modules correctly you can simply load it by modprobe - there is no need to reboot the system  which is often used as a way to trigger DKMS rebuild   You can check an example here

User · Answer

I am working with a AWS DeepAMI P2 instance and suddenly I found that Nvidia-driver command doesn t working and GPU is not found torch or tensorflow library  Then I have resolved the problem in the following way   Run nvcc --version if it doesn t work   Then run the following   apt install nvidia-cuda-toolkit  Hopefully that will solve the problem

User · Answer

None of the above helped for me   I am using Kubernetes on Google Cloud with tesla k-80 gpu   Follow along this guide to ensure you installed everything correctly  https   cloud google com kubernetes-engine docs how-to gpus  I was missing few important things    Installing NVIDIA GPU device drivers On your NODES  To do this use    For COS node    kubectl apply -f https   raw githubusercontent com GoogleCloudPlatform container-engine-accelerators master nvidia-driver-installer cos daemonset-preloaded yaml  For UBUNTU node    kubectl apply -f https   raw githubusercontent com GoogleCloudPlatform container-engine-accelerators master nvidia-driver-installer ubuntu daemonset-preloaded yaml  Make sure an update was rolled to your nodes  Restart them if upgrades are off    I use this image nvidia cuda 10 1-base-ubuntu16 04 in my docker You have to set gpu limit  This is the only way the node driver can communicate with the pod  In your yaml configuration add this under your container   resources    limits      nvidia com gpu  1

User · Answer

Try pulling out the NVIDIA graphics card and reinserting it

User · Answer

What I found to fix the issue regardless of kernel version  was taking the WGET options and having apt install them  sudo apt-get install --reinstall linux-headers-  uname -r   Driver Version  390 138 on Ubuntu server 18 04 4

User · Answer

I solved  NVIDIA-SMI has failed because it couldn t communicate with the NVIDIA driver  on my ASUS laptop with GTX 950m and Ubuntu 18 04 by disabling Secure Boot Control from BIOS

User · Answer

I was getting the same error on my Ubuntu 16 04  Linux 4 14 kernel  in Google Compute Engine with K80 GPU  I upgraded the kernel to 4 15 from 4 14 and boom the problem was solved  Here is how I upgraded my Linux kernel from 4 14 to 4 15   Step 1  Check the existing kernel of your Ubuntu Linux   uname -a  Step 2   Ubuntu maintains a website for all the versions of kernel that have  been released  At the time of this writing  the latest stable release  of Ubuntu kernel is 4 15  If you go to this  link   http   kernel ubuntu com  kernel-ppa mainline v4 15   you will  see several links for download   Step 3   Download the appropriate files based on the type of OS you have  For 64  bit  I would download the following deb files   wget http   kernel ubuntu com  kernel-ppa mainline v4 15 linux-headers- 4 15 0-041500 4 15 0-041500 201802011154 all deb wget http   kernel ubuntu com  kernel-ppa mainline v4 15 linux-headers- 4 15 0-041500-generic 4 15 0-041500 201802011154 amd64 deb wget http   kernel ubuntu com  kernel-ppa mainline v4 15 linux-image- 4 15 0-041500-generic 4 15 0-041500 201802011154 amd64 deb  Step 4   Install all the downloaded deb files   sudo dpkg -i   deb  Step 5  Reboot your machine and check if the kernel has been updated by  uname -a   You should see that your kernel has been upgraded and hopefully nvidia-smi should work

User · Answer

I have been struggling on this issue for two days  sharing my solution here in case anyone may need it  The VMs that I m using are Standard N-series GPU server with 2 K80 cards on Azure platform  With Ubuntu 18 04 OS installed  Apparently there is an update of linux kernel several days before I came across this issue  and after the update the driver stopped working  At first  I did purge and re-install as above replies suggested  Nothing works  Out of sudden I don t remember why I wanted to do it   I updated the default gcc and g   version on one of my VM as following  sudo apt install software-properties-common sudo add-apt-repository ppa ubuntu-toolchain-r test sudo update-alternatives --install  usr bin gcc gcc  usr bin gcc-9 90 sudo update-alternatives --install  usr bin g   g    usr bin g  -9 90  Then I purged the nvidia softwares and reinstall it as instructed in official document please choose the correct one for your system  https   developer nvidia com cuda-downloads target os Linux amp target arch x86 64 amp target distro Ubuntu amp target version 1804 amp target type deblocal  again  sudo apt-get purge nvidia-   Then the nvidia-smi command finally worked again  PS  If you are using Azure linux VM like me  The recommended way to install CUDA is actually by enabling  quot NVIDIA GPU Driver Extension quot  in the Azure portal  of course  after you have configured the correct gcc version   I have tried this way on my another VM and It works as well

User · Answer

I just want to thank  Heapify for providing a practical answer and update his answer because the attached links are not up-to-date   Step 1  Check the existing kernel of your Ubuntu Linux   uname -a   Step 2   Ubuntu maintains a website for all the versions of kernel that have  been released  At the time of this writing  the latest stable release  of Ubuntu kernel is 4 15  If you go to this  link  http   kernel ubuntu com  kernel-ppa mainline v4 15   you will  see several links for download   Step 3   Download the appropriate files based on the type of OS you have  For 64  bit  I would download the following deb files      UP-TO-DATE 2019-03-18 wget https   kernel ubuntu com  kernel-ppa mainline v4 15 linux-headers-4 15 0-041500 4 15 0-041500 201802011154 all deb wget https   kernel ubuntu com  kernel-ppa mainline v4 15 linux-headers-4 15 0-041500-generic 4 15 0-041500 201802011154 amd64 deb wget https   kernel ubuntu com  kernel-ppa mainline v4 15 linux-image-4 15 0-041500-generic 4 15 0-041500 201802011154 amd64 deb   Step 4   Install all the downloaded deb files   sudo dpkg -i   deb   Step 5   Reboot your machine and check if the kernel has been updated by   uname -aenter code here

User · Answer

My system version  ubuntu 20 04 LTS   I solved this by generate a new MOK and enroll it into shim   Without disable of Secure Boot  although it also really works for me   Simply execute this command and follow what it suggests  sudo update-secureboot-policy --enroll-key    According to ubuntu s wiki  How can I do non-automated signing of drivers

User · Answer

Run the following to get the right NVIDIA driver   sudo ubuntu-drivers devices  Then pick the right  and run  sudo apt install  lt version gt

User · Answer

In my case none of the above solutions didn t help  Root cause  incompatible version of gcc Solution  1  sudo apt install --reinstall gcc 2  sudo apt-get --purge -y remove  nvidia   3  sudo apt install nvidia-driver-450  4  sudo reboot  System  AWS EC2 18 04 instance Solution source  https   forums developer nvidia com t nvidia-smi-has-failed-in-ubuntu-18-04 68288 4

User · Answer

It may happen after your Linux kernel update  if you entered this error  you can rebuild your nvidia driver using the following command to fix      Firstly  you need to have dkms   which can automatically regenerate new modules after kernel version changes  sudo apt-get install dkms   Secondly  rebuild your nvidia driver  Here my nvidia driver version is 440 82  if you have installed before  you could check your installed version on  usr src  sudo dkms build -m nvidia -v 440 82   Lastly  reinstall nvidia driver  And then reboot your computer  sudo dkms install -m nvidia -v 440 82   Now you can check to see if you can use it by sudo nvidia-smi

User · Answer

I tried above solutions but only the below worked for me    sudo apt-get update sudo apt-get install --no-install-recommends nvidia-384 libcuda1-384 nvidia-opencl-icd-384 sudo reboot   credit --  https   deeptalk lambdalabs com t nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver 148

User · Answer

I had to install the NVIDIA 367 57 driver and CUDA 7 5 with Tensorflow on the g2 2xlarge Ubuntu 14 04LTS instance   e g  nvidia-graphics-drivers-367 367 57 orig tar  Now the GRID K520 GPU is working while I train tensorflow models   ubuntu ip-10-0-1-70    nvidia-smi Sat Apr  1 18 03 32 2017         -----------------------------------------------------------------------------    NVIDIA-SMI 367 57                 Driver Version  367 57                       ------------------------------- ---------------------- ----------------------    GPU  Name        Persistence-M  Bus-Id        Disp A   Volatile Uncorr  ECC     Fan  Temp  Perf  Pwr Usage Cap          Memory-Usage   GPU-Util  Compute M                                                                                        0  GRID K520           Off    0000 00 03 0     Off                    N A     N A   39C    P8    43W   125W     3800MiB    4036MiB        0       Default    ------------------------------- ---------------------- ----------------------    -----------------------------------------------------------------------------    Processes                                                        GPU Memory      GPU       PID  Type  Process name                               Usage                                                                                             0      2254    C   python                                        3798MiB    -----------------------------------------------------------------------------   ubuntu ip-10-0-1-70   NVIDIA CUDA-7 0 Samples 1 Utilities deviceQuery    deviceQuery    deviceQuery Starting      CUDA Device Query  Runtime API  version  CUDART static linking   Detected 1 CUDA Capable device s   Device 0   GRID K520    CUDA Driver Version   Runtime Version          8 0   7 0   CUDA Capability Major Minor version number     3 0   Total amount of global memory                  4036 MBytes  4232052736 bytes      8  Multiprocessors   192  CUDA Cores MP      1536 CUDA Cores   GPU Max Clock rate                             797 MHz  0 80 GHz    Memory Clock rate                              2500 Mhz   Memory Bus Width                               256-bit   L2 Cache Size                                  524288 bytes   Maximum Texture Dimension Size  x y z          1D  65536   2D  65536  65536   3D  4096  4096  4096    Maximum Layered 1D Texture Size   num  layers  1D  16384   2048 layers   Maximum Layered 2D Texture Size   num  layers  2D  16384  16384   2048 layers   Total amount of constant memory                65536 bytes   Total amount of shared memory per block        49152 bytes   Total number of registers available per block  65536   Warp size                                      32   Maximum number of threads per multiprocessor   2048   Maximum number of threads per block            1024   Max dimension size of a thread block  x y z    1024  1024  64    Max dimension size of a grid size     x y z    2147483647  65535  65535    Maximum memory pitch                           2147483647 bytes   Texture alignment                              512 bytes   Concurrent copy and kernel execution           Yes with 2 copy engine s    Run time limit on kernels                      No   Integrated GPU sharing Host Memory             No   Support host page-locked memory mapping        Yes   Alignment requirement for Surfaces             Yes   Device has ECC support                         Disabled   Device supports Unified Addressing  UVA        Yes   Device PCI Domain ID   Bus ID   location ID    0   0   3   Compute Mode        lt  Default  multiple host threads can use   cudaSetDevice   with device simultaneously   gt   deviceQuery  CUDA Driver   CUDART  CUDA Driver Version   8 0  CUDA Runtime Version   7 0  NumDevs   1  Device0   GRID K520 Result   PASS

[gpu] NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

Examples related to gpu