mrtg script for monitoring temperature
#6
so this post got me thinking... this would actually be a great solution for remotely monitoring compute nodes in a vcl cluster. especially since ADL of cousre does not work with VCL.

so i whipped up this little shell script:

gpu_query

Code:
#!/bin/sh

DISPLAY=:0
attr="$1"
dev="$2"

case "$attr" in
        temp)   amdconfig --adapter=$dev --odgt | awk '/Temperature/ { printf("%d\n", $5); }' ;;
        core)   amdconfig --adapter=$dev --odgc | awk '/Current\ Clocks/ { printf("%d\n", $4); }' ;;
        mem )   amdconfig --adapter=$dev --odgc | awk '/Current\ Clocks/ { printf("%d\n", $5); }' ;;
        load)   amdconfig --adapter=$dev --odgc | awk '/GPU load/ { printf("%d\n", $4); }' ;;
        fan )   DISPLAY=:0.$(($2 - 1)) amdconfig --pplib-cmd "get fanspeed 0" | awk '/Result/ { printf("%d\n", $4); }' ;;
        *   )   echo "$1: unknown option"; exit 2 ;;
esac


then added a few extends to my snmpd.conf:

Code:
extend gpu_temp_1       /usr/local/bin/gpu_query temp 1
extend gpu_temp_2       /usr/local/bin/gpu_query temp 2
extend gpu_temp_3       /usr/local/bin/gpu_query temp 3
extend gpu_fan_1        /usr/local/bin/gpu_query fan 1
extend gpu_fan_2        /usr/local/bin/gpu_query fan 2
extend gpu_fan_3        /usr/local/bin/gpu_query fan 3
extend gpu_load_1       /usr/local/bin/gpu_query load 1
extend gpu_load_2       /usr/local/bin/gpu_query load 2
extend gpu_load_3       /usr/local/bin/gpu_query load 3
extend gpu_core_1       /usr/local/bin/gpu_query core 1
extend gpu_core_2       /usr/local/bin/gpu_query core 2
extend gpu_core_3       /usr/local/bin/gpu_query core 3
extend gpu_mem_1        /usr/local/bin/gpu_query mem 1
extend gpu_mem_2        /usr/local/bin/gpu_query mem 2
extend gpu_mem_3        /usr/local/bin/gpu_query mem 3

repeat on each compute node. install mrtg/cacti/whatever on the cluster controller, and voila.

Code:
epixoip@token:~$ snmpwalk -v 3 -u cacti -l authpriv -x AES -X cactisnmp -a SHA -A cactisnmp butters NET-SNMP-EXTEND-MIB::nsExtendOutput1Line

NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_fan_1" = STRING: 100
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_fan_2" = STRING: 100
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_fan_3" = STRING: 100
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_mem_1" = STRING: 1630
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_mem_2" = STRING: 1630
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_mem_3" = STRING: 1630
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_core_1" = STRING: 925
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_core_2" = STRING: 925
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_core_3" = STRING: 925
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_load_1" = STRING: 99
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_load_2" = STRING: 99
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_load_3" = STRING: 73
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_temp_1" = STRING: 66
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_temp_2" = STRING: 74
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_temp_3" = STRING: 67


Messages In This Thread
RE: mrtg script for monitoring temperature - by epixoip - 07-15-2013, 07:12 PM