Posts: 12
Threads: 2
Joined: Jul 2013
Hi guys,
Been experimenting with different cooling setups and thought this script might be handy for anyone who want to keep an eye on their GPU temperatures. Very quickly put together, but does the job.
Sample:
gputemp.sh
Code:
#!/bin/sh
export DISPLAY=:0
amdconfig --adapter=$1 --odgt | grep 'Temperature' | cut -d'-' -f2 | cut -c 2-3
echo 0
uptime | awk '{ gsub(/,/, ""); print $3, $4, $5; }'
uname -n
Then mrtg.conf:
Code:
WorkDir: /var/www/mrtg/
Target[gpu0.temp]: `/path/to/gputemp.sh 0`
MaxBytes[gpu0.temp]: 99
Title[gpu0.temp]: gpu0 Temperature
PageTop[gpu0.temp]: <H1> gpu0 temperature</H1>
ShortLegend[gpu0.temp]: C
YLegend[gpu0.temp]: Celsius
Options[gpu0.temp]: growright,nopercent, nobanner, noinfo, gauge
Unscaled[gpu0.temp]: ymd
... repeat for each gpu you want to monitor and change the parameter in the target (in my case i have 0 till 7).
Posts: 649
Threads: 18
Joined: Nov 2010
Thats awesome, well done.
Posts: 36
Threads: 5
Joined: Mar 2013
Have not had time to test it but it looks genius. <3
Posts: 2,936
Threads: 12
Joined: May 2012
nicely done!
quick suggestion, though: instead of piping grep into cut, and then into cut again, i would suggest using something like:
Code:
awk '/Temperature/ {printf("%d", $5); }'
Posts: 12
Threads: 2
Joined: Jul 2013
07-15-2013, 09:12 AM
(This post was last modified: 07-15-2013, 09:13 AM by gpufreak.)
@epixoip: thanks thats cleaner :-)
Don't forget to setup a cronjob to run MRTG every 5 minutes so your graphs get updated.
Code:
*/5 * * * * env LANG=C /usr/bin/mrtg /path/to/mrtg.cfg &> /dev/null
I use as WorkDir /var/www/mrtg so a simple apt-get install apache2 will work out of the box by setting the right permissions to the mrtg folder.
Then run "indexmaker" (from MRTG package) to generate the right index.html
Code:
indexmaker -output=/var/www/mrtg/index.html /path/to/mrtg.cfg
Posts: 2,936
Threads: 12
Joined: May 2012
so this post got me thinking... this would actually be a great solution for remotely monitoring compute nodes in a vcl cluster. especially since ADL of cousre does not work with VCL.
so i whipped up this little shell script:
gpu_query
Code:
#!/bin/sh
DISPLAY=:0
attr="$1"
dev="$2"
case "$attr" in
temp) amdconfig --adapter=$dev --odgt | awk '/Temperature/ { printf("%d\n", $5); }' ;;
core) amdconfig --adapter=$dev --odgc | awk '/Current\ Clocks/ { printf("%d\n", $4); }' ;;
mem ) amdconfig --adapter=$dev --odgc | awk '/Current\ Clocks/ { printf("%d\n", $5); }' ;;
load) amdconfig --adapter=$dev --odgc | awk '/GPU load/ { printf("%d\n", $4); }' ;;
fan ) DISPLAY=:0.$(($2 - 1)) amdconfig --pplib-cmd "get fanspeed 0" | awk '/Result/ { printf("%d\n", $4); }' ;;
* ) echo "$1: unknown option"; exit 2 ;;
esac
then added a few extends to my
snmpd.conf:
Code:
extend gpu_temp_1 /usr/local/bin/gpu_query temp 1
extend gpu_temp_2 /usr/local/bin/gpu_query temp 2
extend gpu_temp_3 /usr/local/bin/gpu_query temp 3
extend gpu_fan_1 /usr/local/bin/gpu_query fan 1
extend gpu_fan_2 /usr/local/bin/gpu_query fan 2
extend gpu_fan_3 /usr/local/bin/gpu_query fan 3
extend gpu_load_1 /usr/local/bin/gpu_query load 1
extend gpu_load_2 /usr/local/bin/gpu_query load 2
extend gpu_load_3 /usr/local/bin/gpu_query load 3
extend gpu_core_1 /usr/local/bin/gpu_query core 1
extend gpu_core_2 /usr/local/bin/gpu_query core 2
extend gpu_core_3 /usr/local/bin/gpu_query core 3
extend gpu_mem_1 /usr/local/bin/gpu_query mem 1
extend gpu_mem_2 /usr/local/bin/gpu_query mem 2
extend gpu_mem_3 /usr/local/bin/gpu_query mem 3
repeat on each compute node. install mrtg/cacti/whatever on the cluster controller, and voila.
Code:
epixoip@token:~$ snmpwalk -v 3 -u cacti -l authpriv -x AES -X cactisnmp -a SHA -A cactisnmp butters NET-SNMP-EXTEND-MIB::nsExtendOutput1Line
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_fan_1" = STRING: 100
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_fan_2" = STRING: 100
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_fan_3" = STRING: 100
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_mem_1" = STRING: 1630
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_mem_2" = STRING: 1630
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_mem_3" = STRING: 1630
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_core_1" = STRING: 925
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_core_2" = STRING: 925
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_core_3" = STRING: 925
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_load_1" = STRING: 99
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_load_2" = STRING: 99
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_load_3" = STRING: 73
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_temp_1" = STRING: 66
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_temp_2" = STRING: 74
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_temp_3" = STRING: 67
Posts: 12
Threads: 2
Joined: Jul 2013
Great post!
I was doing it myself with an rsync on the /var/www/mrtg folder to sync stats from other nodes into a merged "index.html" from indexmaker but SNMP is the "right way to do it" ;-)