07-15-2013, 07:12 PM
so this post got me thinking... this would actually be a great solution for remotely monitoring compute nodes in a vcl cluster. especially since ADL of cousre does not work with VCL.
so i whipped up this little shell script:
gpu_query
then added a few extends to my snmpd.conf:
repeat on each compute node. install mrtg/cacti/whatever on the cluster controller, and voila.
so i whipped up this little shell script:
gpu_query
Code:
#!/bin/sh
DISPLAY=:0
attr="$1"
dev="$2"
case "$attr" in
temp) amdconfig --adapter=$dev --odgt | awk '/Temperature/ { printf("%d\n", $5); }' ;;
core) amdconfig --adapter=$dev --odgc | awk '/Current\ Clocks/ { printf("%d\n", $4); }' ;;
mem ) amdconfig --adapter=$dev --odgc | awk '/Current\ Clocks/ { printf("%d\n", $5); }' ;;
load) amdconfig --adapter=$dev --odgc | awk '/GPU load/ { printf("%d\n", $4); }' ;;
fan ) DISPLAY=:0.$(($2 - 1)) amdconfig --pplib-cmd "get fanspeed 0" | awk '/Result/ { printf("%d\n", $4); }' ;;
* ) echo "$1: unknown option"; exit 2 ;;
esac
then added a few extends to my snmpd.conf:
Code:
extend gpu_temp_1 /usr/local/bin/gpu_query temp 1
extend gpu_temp_2 /usr/local/bin/gpu_query temp 2
extend gpu_temp_3 /usr/local/bin/gpu_query temp 3
extend gpu_fan_1 /usr/local/bin/gpu_query fan 1
extend gpu_fan_2 /usr/local/bin/gpu_query fan 2
extend gpu_fan_3 /usr/local/bin/gpu_query fan 3
extend gpu_load_1 /usr/local/bin/gpu_query load 1
extend gpu_load_2 /usr/local/bin/gpu_query load 2
extend gpu_load_3 /usr/local/bin/gpu_query load 3
extend gpu_core_1 /usr/local/bin/gpu_query core 1
extend gpu_core_2 /usr/local/bin/gpu_query core 2
extend gpu_core_3 /usr/local/bin/gpu_query core 3
extend gpu_mem_1 /usr/local/bin/gpu_query mem 1
extend gpu_mem_2 /usr/local/bin/gpu_query mem 2
extend gpu_mem_3 /usr/local/bin/gpu_query mem 3
repeat on each compute node. install mrtg/cacti/whatever on the cluster controller, and voila.
Code:
epixoip@token:~$ snmpwalk -v 3 -u cacti -l authpriv -x AES -X cactisnmp -a SHA -A cactisnmp butters NET-SNMP-EXTEND-MIB::nsExtendOutput1Line
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_fan_1" = STRING: 100
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_fan_2" = STRING: 100
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_fan_3" = STRING: 100
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_mem_1" = STRING: 1630
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_mem_2" = STRING: 1630
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_mem_3" = STRING: 1630
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_core_1" = STRING: 925
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_core_2" = STRING: 925
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_core_3" = STRING: 925
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_load_1" = STRING: 99
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_load_2" = STRING: 99
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_load_3" = STRING: 73
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_temp_1" = STRING: 66
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_temp_2" = STRING: 74
NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."gpu_temp_3" = STRING: 67