I'm running an AWS EC2 g2.2xlarge instance with Ubuntu 14.04 LTS. I'd like to observe the GPU utilization while training my TensorFlow models. I get an error trying to run 'nvidia-smi'.
ubuntu@ip-10-0-1-213:/etc/alternatives$ cd /usr/lib/nvidia-375/bin
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ls
nvidia-bug-report.sh nvidia-debugdump nvidia-xconfig
nvidia-cuda-mps-control nvidia-persistenced
nvidia-cuda-mps-server nvidia-smi
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ dpkg -l | grep nvidia
ii nvidia-346 352.63-0ubuntu0.14.04.1 amd64 Transitional package for nvidia-346
ii nvidia-346-dev 346.46-0ubuntu1 amd64 NVIDIA binary Xorg driver development files
ii nvidia-346-uvm 346.96-0ubuntu0.0.1 amd64 Transitional package for nvidia-346
ii nvidia-352 375.26-0ubuntu1 amd64 Transitional package for nvidia-375
ii nvidia-375 375.39-0ubuntu0.14.04.1 amd64 NVIDIA binary driver - version 375.39
ii nvidia-375-dev 375.39-0ubuntu0.14.04.1 amd64 NVIDIA binary Xorg driver development files
ii nvidia-modprobe 375.26-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
ii nvidia-opencl-icd-346 352.63-0ubuntu0.14.04.1 amd64 Transitional package for nvidia-opencl-icd-352
ii nvidia-opencl-icd-352 375.26-0ubuntu1 amd64 Transitional package for nvidia-opencl-icd-375
ii nvidia-opencl-icd-375 375.39-0ubuntu0.14.04.1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.6.2.1 amd64 Tools to enable NVIDIA's Prime
ii nvidia-settings 375.26-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ lspci | grep -i nvidia
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$
$ inxi -G
Graphics: Card-1: Cirrus Logic GD 5446
Card-2: NVIDIA GK104GL [GRID K520]
X.org: 1.15.1 driver: N/A tty size: 80x24 Advanced Data: N/A out of X
$ lspci -k | grep -A 2 -E "(VGA|3D)"
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
Subsystem: XenSource, Inc. Device 0001
Kernel driver in use: cirrus
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
Subsystem: NVIDIA Corporation Device 1014
00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
I followed these instructions to install CUDA 7 and cuDNN:
$sudo apt-get -q2 update
$sudo apt-get upgrade
$sudo reboot
=======================================================================
Post reboot, update the initramfs by running '$sudo update-initramfs -u'
Now, please edit the /etc/modprobe.d/blacklist.conf file to blacklist nouveau. Open the file in an editor and insert the following lines at the end of the file.
blacklist nouveau blacklist lbm-nouveau options nouveau modeset=0 alias nouveau off alias lbm-nouveau off
Save and exit from the file.
Now install the build essential tools and update the initramfs and reboot again as below:
$sudo apt-get install linux-{headers,image,image-extra}-$(uname -r) build-essential
$sudo update-initramfs -u
$sudo reboot
========================================================================
Post reboot, run the following commands to install Nvidia.
$sudo wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run
$sudo chmod 700 ./cuda_7.0.28_linux.run
$sudo ./cuda_7.0.28_linux.run
$sudo update-initramfs -u
$sudo reboot
========================================================================
Now that the system has come up, verify the installation by running the following.
$sudo modprobe nvidia
$sudo nvidia-smi -q | head`enter code here`
You should see the output like 'nvidia.png'.
Now run the following commands. $
cd ~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery
$make
$./deviceQuery
However, 'nvidia-smi' still doesn't show GPU activity while Tensorflow is training models:
ubuntu@ip-10-0-1-48:~$ ipython
Python 2.7.11 |Anaconda custom (64-bit)| (default, Dec 6 2015, 18:08:32)
Type "copyright", "credits" or "license" for more information.
IPython 4.1.2 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally
ubuntu@ip-10-0-1-48:~$ nvidia-smi
Thu Mar 30 05:45:26 2017
+------------------------------------------------------+
| NVIDIA-SMI 346.46 Driver Version: 346.46 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 Off | 0000:00:03.0 Off | N/A |
| N/A 35C P0 38W / 125W | 10MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
This question is related to
gpu
I just want to thank @Heapify for providing a practical answer and update his answer because the attached links are not up-to-date.
Step 1: Check the existing kernel of your Ubuntu Linux:
uname -a
Step 2:
Ubuntu maintains a website for all the versions of kernel that have been released. At the time of this writing, the latest stable release of Ubuntu kernel is 4.15. If you go to this link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will see several links for download.
Step 3:
Download the appropriate files based on the type of OS you have. For 64 bit, I would download the following deb files:
// UP-TO-DATE 2019-03-18
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
Step 4:
Install all the downloaded deb files:
sudo dpkg -i *.deb
Step 5:
Reboot your machine and check if the kernel has been updated by:
uname -aenter code here
My system version: ubuntu 20.04 LTS.
I solved this by generate a new MOK and enroll it into shim.
Without disable of Secure Boot, although it also really works for me.
Simply execute this command and follow what it suggests:
sudo update-secureboot-policy --enroll-key
According to ubuntu's wiki: How can I do non-automated signing of drivers
Try pulling out the NVIDIA graphics card and reinserting it.
It may happen after your Linux kernel update, if you entered this error, you can rebuild your nvidia driver using the following command to fix:
dkms
, which can automatically regenerate new modules after kernel version changes.sudo apt-get install dkms
/usr/src
.sudo dkms build -m nvidia -v 440.82
sudo dkms install -m nvidia -v 440.82
Now you can check to see if you can use it by sudo nvidia-smi
.
One important fact about NVIDIA drivers that is not very well known is that its built is done by DKMS. This allows automatic rebuild in case of kernel upgrade, this happens on system startup. Because of that, it's quite easy to miss error messages, especially if you're working on cloud VM, or server without an additional IPMI/management interface. However, it's possible to trigger DKMS build just executing dkms autoinstall
right after packages installation. If this fails then you'll have a meaningful error message about missing dependency or what so ever. If dkms autoinstall
builds modules correctly you can simply load it by modprobe
- there is no need to reboot the system (which is often used as a way to trigger DKMS rebuild).
You can check an example here
None of the above helped for me.
I am using Kubernetes on Google Cloud with tesla k-80 gpu.
Follow along this guide to ensure you installed everything correctly: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
I was missing few important things:
For COS node:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
For UBUNTU node:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml
Make sure an update was rolled to your nodes. Restart them if upgrades are off.
I use this image nvidia/cuda:10.1-base-ubuntu16.04 in my docker
You have to set gpu limit! This is the only way the node driver can communicate with the pod. In your yaml configuration add this under your container:
resources:
limits:
nvidia.com/gpu: 1
I tried above solutions but only the below worked for me.
sudo apt-get update
sudo apt-get install --no-install-recommends nvidia-384 libcuda1-384 nvidia-opencl-icd-384
sudo reboot
Solved the problem by re-installing CUDA:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
echo "md5sum: $(md5sum cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb)"
echo "correct: 056de5e03444cce506202f50967b0016"
dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
apt-get -qq update
apt-get -qq -y install cuda
rm cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
In my case none of the above solutions didn't help:
Root cause: incompatible version of gcc
Solution:
1. sudo apt install --reinstall gcc
2. sudo apt-get --purge -y remove 'nvidia*'
3 sudo apt install nvidia-driver-450
4. sudo reboot
System: AWS EC2 18.04 instance
Solution source: https://forums.developer.nvidia.com/t/nvidia-smi-has-failed-in-ubuntu-18-04/68288/4
Run the following to get the right NVIDIA driver :
sudo ubuntu-drivers devices
Then pick the right and run:
sudo apt install <version>
What I found to fix the issue regardless of kernel version, was taking the WGET options and having apt install them.
sudo apt-get install --reinstall linux-headers-$(uname -r)
Driver Version: 390.138 on Ubuntu server 18.04.4
I am working with a AWS DeepAMI P2 instance and suddenly I found that Nvidia-driver command doesn't working and GPU is not found torch or tensorflow library. Then I have resolved the problem in the following way,
Run nvcc --version
if it doesn't work
Then run the following
apt install nvidia-cuda-toolkit
Hopefully that will solve the problem.
I solved "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" on my ASUS laptop with GTX 950m and Ubuntu 18.04 by disabling Secure Boot Control from BIOS.
I had to install the NVIDIA 367.57 driver and CUDA 7.5 with Tensorflow on the g2.2xlarge Ubuntu 14.04LTS instance. e.g. nvidia-graphics-drivers-367_367.57.orig.tar
Now the GRID K520 GPU is working while I train tensorflow models:
ubuntu@ip-10-0-1-70:~$ nvidia-smi
Sat Apr 1 18:03:32 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 Off | 0000:00:03.0 Off | N/A |
| N/A 39C P8 43W / 125W | 3800MiB / 4036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2254 C python 3798MiB |
+-----------------------------------------------------------------------------+
ubuntu@ip-10-0-1-70:~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GRID K520"
CUDA Driver Version / Runtime Version 8.0 / 7.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4036 MBytes (4232052736 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 797 MHz (0.80 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 3
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GRID K520
Result = PASS
I have been struggling on this issue for two days, sharing my solution here in case anyone may need it.
The VMs that I'm using are Standard N-series GPU server with 2 K80 cards on Azure platform. With Ubuntu 18.04 OS installed.
Apparently there is an update of linux kernel several days before I came across this issue, and after the update the driver stopped working.
At first, I did purge and re-install as above replies suggested. Nothing works. Out of sudden(I don't remember why I wanted to do it), I updated the default gcc and g++ version on one of my VM as following.
sudo apt install software-properties-common
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 90
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 90
Then I purged the nvidia softwares and reinstall it as instructed in official document(please choose the correct one for your system: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=deblocal) again.
sudo apt-get purge nvidia-*
Then the nvidia-smi command finally worked again.
PS:
If you are using Azure linux VM like me. The recommended way to install CUDA is actually by enabling "NVIDIA GPU Driver Extension" in the Azure portal (of course, after you have configured the correct gcc version).
I have tried this way on my another VM and It works as well.
I was getting the same error on my Ubuntu 16.04 (Linux 4.14 kernel) in Google Compute Engine with K80 GPU. I upgraded the kernel to 4.15 from 4.14 and boom the problem was solved. Here is how I upgraded my Linux kernel from 4.14 to 4.15:
Step 1:
Check the existing kernel of your Ubuntu Linux:
uname -a
Step 2:
Ubuntu maintains a website for all the versions of kernel that have
been released. At the time of this writing, the latest stable release
of Ubuntu kernel is 4.15. If you go to this
link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will
see several links for download.
Step 3:
Download the appropriate files based on the type of OS you have. For 64
bit, I would download the following deb files:
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
Step 4:
Install all the downloaded deb files:
sudo dpkg -i *.deb
Step 5:
Reboot your machine and check if the kernel has been updated by:
uname -a
You should see that your kernel has been upgraded and hopefully nvidia-smi should work.
Source: Stackoverflow.com