Virtualization for the win
#1
Hi! I haven't seen much about using VM:s for GPU workloads, and I haven't seen anything on this forum. So, I though I would write a bit about it here, because I think it's great.

Summary
It works. No performance penalty found, yet.
TL;DR version at the bottom.

About me
I work pretty much every day with designing/installing/maintaining small to almost-enterprise size VMware virtualization systems, using almost exclusively HPE hardware. It's not ALL I do, but it's a major part of my work.
My posts here are my own and are not endorsed, or even known, by my employer, HPE or VMware.

Why virtualize?
(Warning, some VMware marketspeek included)
Virtualization, generally, provides the following benefits:
  • Flexibility
Virtual guest can be moved around to different hardware and physical locations without downtime
Resources can be assigned from a pool, rather than beeing a fixed size decided at instalation time
  • Agility
Resources can be quickly changed; CPU/memory can be hot added
Network redesign can be done without affecting the guests
VM:s can be deployed "instantly", without having to order new hardware
  • Availability
In case of a host (hardware) failure, guest can be automatically restarted on other hosts
Much reduced hardware in the guests reduces the amount of drivers that can cause problems

In this case, using ESX to host guest running oclHashcat, the following benefits are also realized:
  • Hardware utilization
Using a physical box just to power GPU:s is a waste of space, power and CPU resources. Why not use the box for other purposes at the same time? (Like running a hashtopus installation)
  • Maintenance made easy
It is a known issue of all (?) OS:es that AMD and nVidia drivers cause each other problems. Using virtualization, you can install separate guests for AMD and nVidia GPU:s, with the right drivers in each without conflicts. Or why not have one guest per GPU?
OS, driver and hashcat updates may cause problems or performance issues. Resolution: power down guest, snapshot, power up and perform update(s), test. If issues occur, revert snapshot. In case of a multi-GPU box, do this for one VM with one GPU. When the update procedure is finalized, repeat it on the other guests; downtime is greatly reduced.
  • Different versions
Different versions of hashcat may perform better or not at all with some GPU:s. Install as many guests as you need, with different versions that fits your need. Move the GPU around to the guest that best fits the situation. Run different versions for different GPU:s, all in the same box.

Why NOT virtualize?
The oldschool way of thinking is that "if you need performance, you need dedicated hardware". This is not true anymore; first, hypervisors have matured and does not cause much overhead. Second, most overhead from virtualization is on the CPU part. CPU:s today are hardly ever bottlenecks - and if they are, you can add more cores. The only instance you may actually require dedicated hardware is if your application needs very very high single-thread performance. Well that and if your idiot vendor says virtualization is unsupported...

A real reason not to virtualize is increased complexity. Yes, virtualization makes the system more complex, it adds another layer of "stuff" that you need to plan, maintain and troubleshoot. If you are unwilling to learn something new, buy hardware.

Design
The goal here is to install a vShere host with one or more GPU:s. These GPU:s are then assigned to VM:s using PCI passthrough. The operating system is ESXi; it is available for free from vmware.com. You need to register and checkout a license, or it will not power on guests after 60 days. The free license is limited to 8 vCPU/guest and cannot be connected to VMware vCenter (=> no cluster).
ESXi only supports/contains drivers for a subset of hardware. It can be installed on consumer hardware if the hardware is built from supported chips; unsupported hardware may also work with community drivers. Pretty much all ready-made servers today supports ESXi; but a lot of servers to not support high-power GPU:s, or they have insufficient room for large GPU coolers, or they do not have (enough) PCI-e power.
There are a lot of options for hardware, you need to research a bit to find something that works. I use an HPE Ml350p Gen8: it can fit two 2-slot GPU:s/CPU but you need a special power cable for PCI-e power. Fortunatelly, the power socket was compatible with Corsair modular PSU power cables.


Preparation
Download ESXi. If you have a HPE/Dell server, download the custom ISO or you may not have the drivers you need. You can add drivers to the installtion CD but you cannot add drivers on the fly during the installation.
Fit the GPU and make sure it has enough power.
When you (later) add the GPU to your VM, it will reserve 100% RAM that you configured to your VM. If you need 4 gig RAM/VM and 8 VM:s, you need at least 4gig x 8 + (overhead + ESXi) RAM ~= 34gig ram. You are not constrained to whole GB:s here, you may assign 3750MB RAM instead, if you want to.
(Optional:
I recommend setting the VM advanced option "mem.ShareForceSalting" to 0, this enables me to overallocate RAM better (it enables RAM dedup between guests). It won't make a difference on guest with passthrough, though.
Since these VM:s will be in an isolated environment (I hope), I would turn off ASLR to increase guest RAM dedup.
Overallocating CPU is OK, but will notice increased latency for certain applications. Web server generally do not perform well on overallocated CPU - you may apply this fix to the VM if you think you need it: http://kb.vmware.com/kb/1018276. But consider the fact that you will be burning away some CPU that may be used for other VM:s instead. Also, those idle loops are counted towards the guest resources shares - you may end up with using half your CPU for nothing and not be eligible for CPU when you need it. SQL servers are not bothered by CPU latency; performance decreases with less CPU available, but it will use those CPU cycles effectively.
If you REALLY want to reserve an entire core (or several) for your guest, change CPU reserve to 100% and set the advanced option "monitor_control.halt_desched" to false. Your VM will now never be required to share it's core. Also, if you have hyperthreading active (you should), you may ensure that the other thread is not used by setting Hyperthreaded core sharing: none. This ensures that your VM will have 100% of the core cache and CPU time.
)

Installation
Install ESXi.
Connect to ESXi using a web browser and download vSphere Client (the client is version specific and the link is to and Internet URL).
Configure time and date. This is important: when a guest is restarted or a snapshot reverted, the guest time is set to whatever ESXi has. Is is a reoccuring problem that this step is forgotten, and VM:s start up with the wrong time/date.
In vSphere client, select your host, tab Configuration, option Advanced Settings. Enable PCI passthrough on your GPU. Reboot.
Install a VM to run oclHashcat. I use Ubuntu according to this guide: https://hashcat.net/wiki/doku.php?id=linux_server_howto Thus, I have a 64bit-only installation, version 14.04, with 'fglrx' drivers.
Install open-vm-tools; trust me, you want it. You will then be able to right-click "shutdown" instead of "poweroff" and you will see the IP addresses in vSphere client.
When the VM is installed and updated and works like you want it to, shut it down and edit the virtual hardware.
Add the PCI device (your GPU).
Start your MV, get oclHashcat.
Done!

Benchmarking
This is from my server, running ESXi 5.5 on an HPE ML350p Gen8, and a second-hand Radeon 6970. The host is running a total of 12 VM:s with total 17 vCPU:s, on a single 4-core Intel E5-2609 (no hyperthreading). 48gig RAM with 60gig RAM assigned to guests.

CPU usage during benchmark was average ~250MHz, or 10% of core speed. That's a lot of CPU that can be used for other things... ;) It's hash dependant though, some hashes actually used 97% CPU.
Code:
oclHashcat v2.01 starting in benchmark-mode...

Device #1: Cayman, 2010MB, 880Mhz, 24MCU

Hashtype: MD4
Workload: 1024 loops, 256 accel

Speed.GPU.#1.: 10566.3 MH/s

Hashtype: MD5
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  5698.3 MH/s

Hashtype: Half MD5
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  3465.1 MH/s

Hashtype: SHA1
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  1895.7 MH/s

Hashtype: SHA256
Workload: 512 loops, 256 accel

Speed.GPU.#1.:   772.0 MH/s

Hashtype: SHA384
Workload: 256 loops, 256 accel

Speed.GPU.#1.:   214.7 MH/s

Hashtype: SHA512
Workload: 256 loops, 256 accel

Speed.GPU.#1.:   217.4 MH/s

Hashtype: SHA-3(Keccak)
Workload: 512 loops, 256 accel

Hashtype: SipHash
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  5490.3 MH/s

Hashtype: RipeMD160
Workload: 512 loops, 256 accel

Speed.GPU.#1.:  1209.3 MH/s

Hashtype: Whirlpool
Workload: 512 loops, 32 accel

Speed.GPU.#1.: 76044.8 kH/s

Hashtype: GOST R 34.11-94
Workload: 512 loops, 64 accel

Speed.GPU.#1.: 59761.0 kH/s

Hashtype: GOST R 34.11-2012 (Streebog) 256-bit
Workload: 512 loops, 16 accel

Speed.GPU.#1.: 11139.3 kH/s

Hashtype: GOST R 34.11-2012 (Streebog) 512-bit
Workload: 512 loops, 16 accel

Speed.GPU.#1.: 11049.8 kH/s

Hashtype: phpass, MD5(Wordpress), MD5(phpBB3), MD5(Joomla)
Workload: 1024 loops, 32 accel

Speed.GPU.#1.:  1583.9 kH/s

Hashtype: scrypt
Workload: 1 loops, 64 accel

Speed.GPU.#1.:   167.7 kH/s

Hashtype: PBKDF2-HMAC-MD5
Workload: 1000 loops, 8 accel

Speed.GPU.#1.:   467.4 kH/s

Hashtype: PBKDF2-HMAC-SHA1
Workload: 1000 loops, 8 accel

Speed.GPU.#1.:   638.0 kH/s

Hashtype: PBKDF2-HMAC-SHA256
Workload: 1000 loops, 8 accel

Speed.GPU.#1.:   324.2 kH/s

Hashtype: PBKDF2-HMAC-SHA512
Workload: 1000 loops, 8 accel

Speed.GPU.#1.:    67549 H/s

Hashtype: Skype
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  3149.5 MH/s

Hashtype: WPA/WPA2
Workload: 1024 loops, 32 accel

Speed.GPU.#1.:    95159 H/s

Hashtype: IKE-PSK MD5
Workload: 256 loops, 128 accel

Speed.GPU.#1.:   225.9 MH/s

Hashtype: IKE-PSK SHA1
Workload: 256 loops, 128 accel

Speed.GPU.#1.:   158.7 MH/s

Hashtype: NetNTLMv1-VANILLA / NetNTLMv1+ESS
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  5379.0 MH/s

Hashtype: NetNTLMv2
Workload: 512 loops, 256 accel

Speed.GPU.#1.:   262.5 MH/s

Hashtype: IPMI2 RAKP HMAC-SHA1
Workload: 512 loops, 256 accel

Speed.GPU.#1.:   350.3 MH/s

Hashtype: Kerberos 5 AS-REQ Pre-Auth etype 23
Workload: 128 loops, 32 accel

Speed.GPU.#1.: 13883.0 kH/s

Hashtype: DNSSEC (NSEC3)
Workload: 512 loops, 256 accel

Speed.GPU.#1.:   586.1 MH/s

Hashtype: PostgreSQL Challenge-Response Authentication (MD5)
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  1075.2 MH/s

Hashtype: MySQL Challenge-Response Authentication (SHA1)
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:   544.2 MH/s

Hashtype: SIP digest authentication (MD5)
Workload: 1024 loops, 32 accel

Speed.GPU.#1.:   291.4 MH/s

Hashtype: SMF > v1.1
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  1607.9 MH/s

Hashtype: vBulletin < v3.8.5
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  1513.6 MH/s

Hashtype: vBulletin > v3.8.5
Workload: 512 loops, 256 accel

Speed.GPU.#1.:  1089.3 MH/s

Hashtype: IPB2+, MyBB1.2+
Workload: 512 loops, 256 accel

Speed.GPU.#1.:  1136.0 MH/s

Hashtype: WBB3, Woltlab Burning Board 3
Workload: 512 loops, 256 accel

Speed.GPU.#1.:   223.8 MH/s

Hashtype: Joomla < 2.5.18
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  5697.9 MH/s

Hashtype: PHPS
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  1513.7 MH/s

Hashtype: Drupal7
Workload: 1024 loops, 8 accel

Speed.GPU.#1.:     8962 H/s

Hashtype: osCommerce, xt:Commerce
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  3149.4 MH/s

Hashtype: PrestaShop
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  1903.2 MH/s

Hashtype: Django (SHA-1)
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  1607.9 MH/s

Hashtype: Django (PBKDF2-SHA256)
Workload: 1024 loops, 8 accel

Speed.GPU.#1.:    16607 H/s

Hashtype: Mediawiki B type
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  1467.5 MH/s

Hashtype: Redmine Project Management Web App
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:   359.3 MH/s

Hashtype: PostgreSQL
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  5696.5 MH/s

Hashtype: MSSQL(2000)
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  2011.1 MH/s

Hashtype: MSSQL(2005)
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  2011.9 MH/s

Hashtype: MSSQL(2012)
Workload: 256 loops, 256 accel

Speed.GPU.#1.:   215.8 MH/s

Hashtype: MySQL323
Workload: 1024 loops, 256 accel

Speed.GPU.#1.: 11829.0 MH/s

Hashtype: MySQL4.1/MySQL5
Workload: 512 loops, 256 accel

Speed.GPU.#1.:   890.7 MH/s

Hashtype: Oracle H: Type (Oracle 7+)
Workload: 128 loops, 64 accel

Speed.GPU.#1.:   188.7 MH/s

Hashtype: Oracle S: Type (Oracle 11+)
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  1904.2 MH/s

Hashtype: Oracle T: Type (Oracle 12+)
Workload: 1024 loops, 8 accel

Speed.GPU.#1.:    16568 H/s

Hashtype: Sybase ASE
Workload: 512 loops, 32 accel

Speed.GPU.#1.: 88246.0 kH/s

Hashtype: EPiServer 6.x < v4
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  1607.8 MH/s

Hashtype: EPiServer 6.x > v4
Workload: 512 loops, 256 accel

Speed.GPU.#1.:   675.3 MH/s

Hashtype: md5apr1, MD5(APR), Apache MD5
Workload: 1000 loops, 32 accel

Speed.GPU.#1.:  2459.1 kH/s

Hashtype: ColdFusion 10+
Workload: 256 loops, 128 accel

Speed.GPU.#1.:   372.8 MH/s

Hashtype: hMailServer
Workload: 512 loops, 256 accel

Speed.GPU.#1.:   675.3 MH/s

Hashtype: SHA-1(Base64), nsldap, Netscape LDAP SHA
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  1902.1 MH/s

Hashtype: SSHA-1(Base64), nsldaps, Netscape LDAP SSHA
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  1870.7 MH/s

Hashtype: SSHA-512(Base64), LDAP {SSHA512}
Workload: 256 loops, 256 accel

Speed.GPU.#1.:   217.4 MH/s

Hashtype: LM
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  3851.4 MH/s

Hashtype: NTLM
Workload: 1024 loops, 256 accel

Speed.GPU.#1.: 10536.4 MH/s

Hashtype: Domain Cached Credentials (DCC), MS Cache
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  2919.9 MH/s

Hashtype: Domain Cached Credentials 2 (DCC2), MS Cache 2
Workload: 1024 loops, 16 accel

Speed.GPU.#1.:    76150 H/s

Hashtype: MS-AzureSync PBKDF2-HMAC-SHA256
Workload: 100 loops, 256 accel

Speed.GPU.#1.:  3050.6 kH/s

Hashtype: descrypt, DES(Unix), Traditional DES
Workload: 1024 loops, 64 accel

Speed.GPU.#1.: 77154.6 kH/s

Hashtype: BSDiCrypt, Extended DES
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  1104.9 kH/s

Hashtype: md5crypt, MD5(Unix), FreeBSD MD5, Cisco-IOS MD5
Workload: 1000 loops, 32 accel

Speed.GPU.#1.:  2457.2 kH/s

Hashtype: bcrypt, Blowfish(OpenBSD)
Workload: 32 loops, 2 accel

Speed.GPU.#1.:     2320 H/s

Hashtype: sha256crypt, SHA256(Unix)
Workload: 1024 loops, 4 accel

Speed.GPU.#1.:    92315 H/s

Hashtype: sha512crypt, SHA512(Unix)
Workload: 1024 loops, 8 accel

Speed.GPU.#1.:     8286 H/s

Hashtype: OSX v10.4, v10.5, v10.6
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  1607.9 MH/s

Hashtype: OSX v10.7
Workload: 256 loops, 256 accel

Speed.GPU.#1.:   180.9 MH/s

Hashtype: OSX v10.8+
Workload: 1024 loops, 2 accel

Speed.GPU.#1.:     1724 H/s

Hashtype: AIX {smd5}
Workload: 1000 loops, 32 accel

Speed.GPU.#1.:  2460.1 kH/s

Hashtype: AIX {ssha1}
Workload: 64 loops, 128 accel

Speed.GPU.#1.: 10458.6 kH/s

Hashtype: AIX {ssha256}
Workload: 64 loops, 128 accel

Speed.GPU.#1.:  4584.9 kH/s

Hashtype: AIX {ssha512}
Workload: 64 loops, 32 accel

Speed.GPU.#1.:  1074.2 kH/s

Hashtype: Cisco-PIX MD5
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  3994.7 MH/s

Hashtype: Cisco-ASA MD5
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  3981.6 MH/s

Hashtype: Cisco-IOS SHA256
Workload: 512 loops, 256 accel

Speed.GPU.#1.:   772.0 MH/s

Hashtype: Cisco $8$
Workload: 1024 loops, 8 accel

Speed.GPU.#1.:    16616 H/s

Hashtype: Cisco $9$
Workload: 1 loops, 4 accel

Speed.GPU.#1.:      764 H/s

Hashtype: Juniper Netscreen/SSG (ScreenOS)
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  3149.3 MH/s

Hashtype: Juniper IVE
Workload: 1000 loops, 32 accel

Speed.GPU.#1.:  2463.4 kH/s

Hashtype: Android PIN
Workload: 1024 loops, 16 accel

Speed.GPU.#1.:  1364.8 kH/s

Hashtype: Citrix NetScaler
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  1703.6 MH/s

Hashtype: RACF
Workload: 128 loops, 256 accel

Speed.GPU.#1.:   547.4 MH/s

Hashtype: GRUB 2
Workload: 1024 loops, 2 accel

Speed.GPU.#1.:     6025 H/s

Hashtype: Radmin2
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  1922.3 MH/s

Hashtype: SAP CODVN B (BCODE)
Workload: 1024 loops, 64 accel

Speed.GPU.#1.:   161.0 MH/s

Hashtype: SAP CODVN F/G (PASSCODE)
Workload: 512 loops, 32 accel

Speed.GPU.#1.: 11874.8 kH/s

Hashtype: SAP CODVN H (PWDSALTEDHASH) iSSHA-1
Workload: 1024 loops, 16 accel

Speed.GPU.#1.:  1401.9 kH/s

Hashtype: Lotus Notes/Domino 5
Workload: 128 loops, 32 accel

Speed.GPU.#1.: 59499.5 kH/s

Hashtype: Lotus Notes/Domino 6
Workload: 128 loops, 32 accel

Speed.GPU.#1.: 11415.7 kH/s

Hashtype: Lotus Notes/Domino 8
Workload: 1024 loops, 64 accel

Speed.GPU.#1.:   134.4 kH/s

Hashtype: PeopleSoft
Workload: 1024 loops, 256 accel

Speed.GPU.#1.:  2011.2 MH/s

Hashtype: 7-Zip
Workload: 1024 loops, 4 accel

Speed.GPU.#1.:     2030 H/s

Hashtype: RAR3-hp
Workload: 16384 loops, 32 accel

Speed.GPU.#1.:     5191 H/s

Hashtype: TrueCrypt 5.0+ PBKDF2-HMAC-RipeMD160 + XTS 512 bit
Workload: 1024 loops, 64 accel

Speed.GPU.#1.:    26717 H/s

Hashtype: TrueCrypt 5.0+ PBKDF2-HMAC-SHA512 + XTS 512 bit
Workload: 1000 loops, 8 accel

Speed.GPU.#1.:    65811 H/s

Hashtype: TrueCrypt 5.0+ PBKDF2-HMAC-Whirlpool + XTS 512 bit
Workload: 1000 loops, 8 accel

Speed.GPU.#1.:    12991 H/s

Hashtype: TrueCrypt 5.0+ PBKDF2-HMAC-RipeMD160 + XTS 512 bit + boot-mode
Workload: 1000 loops, 128 accel

Speed.GPU.#1.:    52890 H/s

Hashtype: Android FDE <= 4.3
Workload: 1024 loops, 32 accel

Speed.GPU.#1.:   166.9 kH/s

Hashtype: eCryptfs
Workload: 1024 loops, 8 accel

Speed.GPU.#1.:     2616 H/s

Hashtype: MS Office <= 2003 MD5 + RC4, oldoffice$0, oldoffice$1
Workload: 1024 loops, 32 accel

Speed.GPU.#1.: 14539.3 kH/s

Hashtype: MS Office <= 2003 MD5 + RC4, collision-mode #1
Workload: 1024 loops, 32 accel

Speed.GPU.#1.: 27812.3 kH/s

Hashtype: MS Office <= 2003 SHA1 + RC4, oldoffice$3, oldoffice$4
Workload: 1024 loops, 32 accel

Speed.GPU.#1.: 18776.4 kH/s

Hashtype: MS Office <= 2003 SHA1 + RC4, collision-mode #1
Workload: 1024 loops, 32 accel

Speed.GPU.#1.: 29260.3 kH/s

Hashtype: Office 2007
Workload: 1024 loops, 32 accel

Speed.GPU.#1.:    27056 H/s

Hashtype: Office 2010
Workload: 1024 loops, 32 accel

Speed.GPU.#1.:    13553 H/s

Hashtype: Office 2013
Workload: 1024 loops, 4 accel

Speed.GPU.#1.:     1512 H/s

Hashtype: PDF 1.1 - 1.3 (Acrobat 2 - 4)
Workload: 1024 loops, 32 accel
    
Speed.GPU.#1.: 28542.8 kH/s

Hashtype: PDF 1.1 - 1.3 (Acrobat 2 - 4) + collider-mode #1
Workload: 1024 loops, 32 accel

Speed.GPU.#1.: 32523.0 kH/s

Hashtype: PDF 1.4 - 1.6 (Acrobat 5 - 8)
Workload: 70 loops, 256 accel

Speed.GPU.#1.:  1330.0 kH/s

Hashtype: PDF 1.7 Level 3 (Acrobat 9)
Workload: 512 loops, 256 accel

Speed.GPU.#1.:   771.8 MH/s

Hashtype: PDF 1.7 Level 8 (Acrobat 10 - 11)
Workload: 64 loops, 8 accel

Speed.GPU.#1.:     6060 H/s

Hashtype: Password Safe v2
Workload: 1000 loops, 16 accel

Speed.GPU.#1.:    43785 H/s

Hashtype: Password Safe v3
Workload: 1024 loops, 16 accel

Speed.GPU.#1.:   351.1 kH/s

Hashtype: Lastpass
Workload: 500 loops, 64 accel

Speed.GPU.#1.:   641.4 kH/s

Hashtype: 1Password, agilekeychain
Workload: 1000 loops, 64 accel

Speed.GPU.#1.:   669.8 kH/s

Hashtype: 1Password, cloudkeychain
Workload: 1024 loops, 2 accel

Speed.GPU.#1.:     1508 H/s

Hashtype: Bitcoin/Litecoin wallet.dat
Workload: 1024 loops, 2 accel

Speed.GPU.#1.:      376 H/s

Hashtype: Blockchain, My Wallet
Workload: 10 loops, 256 accel

Speed.GPU.#1.: 11149.6 kH/s
Performance of the 6970 I find around the Internets, is MD5=5878MH/s and WPA2≃82kH/s.
I was expecting to get at least 90% of that speed - but the benchmark got _more_ than others have posted for WPA2, and the same (almost) for MD5. It seems performance of hashcat is totally unaffected by beeing virtualized.
My performance is MD5=5698MH/s and WPA2=95kH/s.


Extras
I use hashtopus to control this VM. To start the agent, I use this little script:
Code:
#!/bin/bash
aticonfig --pplib-cmd "set fanspeed 0 60"
screen -d -m -S HASHAGENT01 mono hashtopus.exe
The agent execution command line (set in hashtopus agent config) must contain "--gpu-temp-disable" for the fan speed config to work.
Here's a quick reference to the screen command: http://aperiodic.net/screen/quick_reference

You may check GPU temp and fan speed this way:
aticonfig --odgt --pplib-cmd "get fanspeed 0"

Conclusion
Running oclHashcat in a VM instead of on hardware, can be a viable way to optimize hardware usage and ease management, IF you have the knowledge and time to set it up. So far, it seems there is not inpact on performance, though further testing on multiple and more recent GPU:s are needed to establish this as a fact.


TL;DR:
1. Install ESXi
2. Activate PCI Passthrough, reboot
3. Install a virtual guest as usual - add open-vm-tools
4. Shutdown VM and add GPU to VM
5. Start VM, install oclHashcat
6. ?
7. Profit!

Try it, you'll like it! :)
#2
On Reddit and other tech forums there have actually been an explosion of people asking about virtualization and GPU passthrough mostly due to these Youtube videos -

https://www.youtube.com/watch?v=LuJYMCbIbPk
https://www.youtube.com/watch?v=LXOaCkbt4lI
https://www.youtube.com/watch?v=opX-AsJ5Uy8
https://www.youtube.com/watch?v=uKJw8IKVYQ8
#3
Now one question remains: why sacrifice 10% performance for the offered flexibility with 100% uptime guarantee? You can have an uptime of 99.9% without the effort of virtualizing. (unless you need 1h maintenance per day to update the hardware…)

Agility for hot-add/remove ram/cpu? You never need to change that for a GPU cluster.

Availability: yeah. Why the hell do I need 100% availability for a hash cracking cluster? What is the gain? If it's 1% better uptime (doubtful) that still does not account for the 10% of performance loss.
#4
I agree, 10% less performance on a cluster dedicated to gpu-hashing wouldn't make sense. I think you misinterpreted part of my post. The point was to make public my results on running oclhashcat in a VM. My finding was that using the GPU in passthrough mode probably does not affect performance and that it is therefore a viable way to optimize hardware usage and ease management. I mentioned availability as a benefit in the explanation as to why, generally, IT infrastructure is being virtualized.

First, my result was 2,5% less md5 speed, compared to an unsubstantiated claim from someone using different hardware and probably different oclhashcat version
Second, my result was also 15% faster than the wpa2 crackstation benchmark; as you can see, I lack proper data to compare too, however... since the result was either faster or very slightly less, I concluded that no, or very slight, performance impact is introduced.

To make my stance on this clearer then:
  1. If you have several nodes doing nothing but hashing, all (but perhaps one) might as well remain dedicated because you're using so much money on hardware that you can afford to solve your other computing needs better.
  2. When you are evaluating new drivers/versions/OS, you should perhaps consider using a VM because the shortcuts available will not only save you time, it will make your work less tedious. And you could test it on one GPU, while the other 10 keeps working.
  3. If you have a small number of nodes and would like to not have a separate set of compute nodes, installing a hypervisor could be a good idea if performance doesn't suffer; as I wrote, more testing is needed.
  4. If you're like me and have: one good lab server, limited budget, space and time and just want to be able to do some hashing while using the box for other things too: this is definitely the way to go even if performance suffers a bit.
I hope that makes it clear. But do try to find other deficiencies in my reasoning, I like arguing Smile
#5
(06-21-2016, 02:59 AM)RulerOfHeck Wrote: I agree, 10% less performance on a cluster dedicated to gpu-hashing wouldn't make sense. I think you misinterpreted part of my post. The point was to make public my results on running oclhashcat in a VM. My finding was that using the GPU in passthrough mode probably does not affect performance and that it is therefore a viable way to optimize hardware usage and ease management. I mentioned availability as a benefit in the explanation as to why, generally, IT infrastructure is being virtualized.

First, my result was 2,5% less md5 speed, compared to an unsubstantiated claim from someone using different hardware and probably different oclhashcat version
Second, my result was also 15% faster than the wpa2 crackstation benchmark; as you can see, I lack proper data to compare too, however... since the result was either faster or very slightly less, I concluded that no, or very slight, performance impact is introduced.

To make my stance on this clearer then:
  1. If you have several nodes doing nothing but hashing, all (but perhaps one) might as well remain dedicated because you're using so much money on hardware that you can afford to solve your other computing needs better.
  2. When you are evaluating new drivers/versions/OS, you should perhaps consider using a VM because the shortcuts available will not only save you time, it will make your work less tedious. And you could test it on one GPU, while the other 10 keeps working.
  3. If you have a small number of nodes and would like to not have a separate set of compute nodes, installing a hypervisor could be a good idea if performance doesn't suffer; as I wrote, more testing is needed.
  4. If you're like me and have: one good lab server, limited budget, space and time and just want to be able to do some hashing while using the box for other things too: this is definitely the way to go even if performance suffers a bit.
I hope that makes it clear. But do try to find other deficiencies in my reasoning, I like arguing Smile

can u perhaps video record all ur work, from a to z so ppl could see and replicate ur work?