mrtg script for monitoring temperature
#1
Hi guys,

Been experimenting with different cooling setups and thought this script might be handy for anyone who want to keep an eye on their GPU temperatures. Very quickly put together, but does the job.

Sample:

[Image: Screen_Shot_2013_07_14_at_22_25_42.jpg]

gputemp.sh

Code:
#!/bin/sh
export DISPLAY=:0
amdconfig --adapter=$1 --odgt | grep 'Temperature' | cut -d'-' -f2 | cut -c 2-3
echo 0
uptime | awk '{ gsub(/,/, ""); print $3, $4, $5; }'
uname -n

Then mrtg.conf:

Code:
WorkDir: /var/www/mrtg/

Target[gpu0.temp]: `/path/to/gputemp.sh 0`
MaxBytes[gpu0.temp]: 99
Title[gpu0.temp]: gpu0 Temperature
PageTop[gpu0.temp]: <H1> gpu0 temperature</H1>
ShortLegend[gpu0.temp]: C
YLegend[gpu0.temp]: Celsius
Options[gpu0.temp]: growright,nopercent, nobanner, noinfo, gauge
Unscaled[gpu0.temp]: ymd

... repeat for each gpu you want to monitor and change the parameter in the target (in my case i have 0 till 7).
#2
Thats awesome, well done.
#3
Have not had time to test it but it looks genius. <3
#4
nicely done!

quick suggestion, though: instead of piping grep into cut, and then into cut again, i would suggest using something like:

Code:
awk '/Temperature/ {printf("%d", $5); }'
#5
@epixoip: thanks thats cleaner :-)

Don't forget to setup a cronjob to run MRTG every 5 minutes so your graphs get updated.

Code:
*/5 * * * * env LANG=C /usr/bin/mrtg /path/to/mrtg.cfg &> /dev/null

I use as WorkDir /var/www/mrtg so a simple apt-get install apache2 will work out of the box by setting the right permissions to the mrtg folder.

Then run "indexmaker" (from MRTG package) to generate the right index.html

Code:
indexmaker -output=/var/www/mrtg/index.html /path/to/mrtg.cfg
#6
so this post got me thinking... this would actually be a great solution for remotely monitoring compute nodes in a vcl cluster. especially since ADL of cousre does not work with VCL.

so i whipped up this little shell script:

gpu_query

Code:
#!/bin/sh

DISPLAY=:0
attr="$1"
dev="$2"

case "$attr" in
        temp)   amdconfig --adapter=$dev --odgt | awk '/Temperature/ { printf("%d\n", $5); }' ;;
        core)   amdconfig --adapter=$dev --odgc | awk '/Current\ Clocks/ { printf("%d\n", $4); }' ;;
        mem )   amdconfig --adapter=$dev --odgc | awk '/Current\ Clocks/ { printf("%d\n", $5); }' ;;
        load)   amdconfig --adapter=$dev --odgc | awk '/GPU load/ { printf("%d\n", $4); }' ;;
        fan )   DISPLAY=:0.$(($2 - 1)) amdconfig --pplib-cmd "get fanspeed 0" | awk '/Result/ { printf("%d\n", $4); }' ;;
        *   )   echo "$1: unknown option"; exit 2 ;;
esac


then added a few extends to my snmpd.conf:

Code:
extend gpu_temp_1       /usr/local/bin/gpu_query temp 1
extend gpu_temp_2       /usr/local/bin/gpu_query temp 2
extend gpu_temp_3       /usr/local/bin/gpu_query temp 3
extend gpu_fan_1        /usr/local/bin/gpu_query fan 1
extend gpu_fan_2        /usr/local/bin/gpu_query fan 2
extend gpu_fan_3        /usr/local/bin/gpu_query fan 3
extend gpu_load_1       /usr/local/bin/gpu_query load 1
extend gpu_load_2       /usr/local/bin/gpu_query load 2
extend gpu_load_3       /usr/local/bin/gpu_query load 3
extend gpu_core_1       /usr/local/bin/gpu_query core 1
extend gpu_core_2       /usr/local/bin/gpu_query core 2
extend gpu_core_3       /usr/local/bin/gpu_query core 3
extend gpu_mem_1        /usr/local/bin/gpu_query mem 1
extend gpu_mem_2        /usr/local/bin/gpu_query mem 2
extend gpu_mem_3        /usr/local/bin/gpu_query mem 3

repeat on each compute node. install mrtg/cacti/whatever on the cluster controller, and voila.

Code:
epixoip@token:~$ snmpwalk -v 3 -u cacti -l authpriv -x AES -X cactisnmp -a SHA -A cactisnmp butters NET-SNMP-EXTEND-MIB::nsExtendOutput1Line

NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_fan_1" = STRING: 100
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_fan_2" = STRING: 100
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_fan_3" = STRING: 100
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_mem_1" = STRING: 1630
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_mem_2" = STRING: 1630
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_mem_3" = STRING: 1630
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_core_1" = STRING: 925
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_core_2" = STRING: 925
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_core_3" = STRING: 925
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_load_1" = STRING: 99
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_load_2" = STRING: 99
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_load_3" = STRING: 73
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_temp_1" = STRING: 66
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_temp_2" = STRING: 74
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_temp_3" = STRING: 67
#7
Great post!

I was doing it myself with an rsync on the /var/www/mrtg folder to sync stats from other nodes into a merged "index.html" from indexmaker but SNMP is the "right way to do it" ;-)