基于DCGM的ubuntu GPU监控方案

2021-10-26 k8s gpu, grafana, prometheus 0 Comments Word Count: 359(words) Read Count: 1(minutes)

背景

手上有几台gpu机器，目前没有单独加入到k8s集群，所以需要单独装一下监控来观察他的利用率情况。

技术

目前使用NVIDIA/gpu-monitoring-tools来做监控，然后对接prometheus。

基础信息

1
2

机器型号：inux version 4.15.0-136-generic (buildd@lcy01-amd64-029) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #140-Ubuntu SMP Thu Jan 28 05:20:47 UTC 2021
gpu版本：NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0

安装

安装docker

1	apt install docker.io

部署exporter

1	docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu18.04

测试

1	curl localhost:9400/metrics

GPU指标说明

可以参考下这个文章：点击

异常信息

如果在部署exporter的时候报错如下：

1	docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

请这样处理：(记住一定要重启下docker)

1. distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
2. curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
3. curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
4. apt-get update
5. apt-get install nvidia-container-toolkit
6. systemctl restart docker

配置prometheus

部署好之后，我们需要配置下prometheus，新增一个target：

- job_name: 'GpuMonitor'
  scrape_interval: 300s
  scrape_timeout: 300s
  metrics_path: /metrics
  scheme: http
  static_configs:
  - targets:
    - 10.10.10.1:9400
    - 10.10.10.2:9400

配置grafana

grafana模板到官方网站找一个，很多都可以，我用的这个模板：grafana模板

导入之后看看效果
png1

本文链接： https://blog.itmonkey.icu/2021/10/26/ubuntu-gpu-monitor-prometheus/

版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

猿的野生香蕉SRE

一个在运维道路上狂飙的老司机