Merge pull request #1826 from mindprince/gpu-docs

Add docs for using nvidia gpu monitoring.
2017-11-28 17:49:29 -08:00 · 2017-11-28 17:49:29 -08:00 · b26bf6ebb2
commit b26bf6ebb2
parent 49440c7e0a 6ba3fa4e8c
1 changed files with 23 additions and 0 deletions
--- a/docs/running.md
+++ b/docs/running.md
@ -82,3 +82,26 @@ cAdvisor is now running (in the foreground) on `http://localhost:8080/`.
 ## Runtime Options

 cAdvisor has a series of flags that can be used to configure its runtime behavior. More details can be found in runtime [options](runtime_options.md).
+
+## Hardware Accelerator Monitoring
+
+cAdvisor can export some metrics for hardware accelerators attached to containers.
+Currently only Nvidia GPUs are supported. There are no machine level metrics.
+So, metrics won't show up if no container with accelerators attached is running.
+Metrics will only show up if accelerators are explicitly attached to the container, e.g., by passing `--device /dev/nvidia0:/dev/nvidia0` flag to docker.
+If nothing is explicitly attached to the container, metrics will NOT show up. This can happen when you access accelerators from privileged containers.
+
+There are two things that cAdvisor needs to show Nvidia GPU metrics:
+- access to NVML library (`libnvidia-ml.so.1`).
+- access to the GPU devices.
+
+If you are running cAdvisor inside a container, you will need to do the following to give the container access to NVML library:
+```
+-e LD_LIBRARY_PATH=<path-where-nvml-is-present>
+--volume <above-path>:<above-path>
+```
+
+If you are running cAdvisor inside a container, you can do one of the following to give it access to the GPU devices:
+- Run with `--privileged`
+- If you are on docker v17.04.0-ce or above, run with `--device-cgroup-rule 'c 195:* mrw'`
+- Run with `--device /dev/nvidiactl:/dev/nvidiactl /dev/nvidia0:/dev/nvidia0 /dev/nvidia1:/dev/nvidia1 <and-so-on-for-all-nvidia-devices>`