Adding PCFGs to Hashcat's Brain
#1
I'm very excited about Hashcat's Brain server and have started to look into if it would be possible to incorporate a PCFG guesser into it. I don't think I'll actually be doing any coding until January, but I wanted to write down some notes and get people's input into it.

---Random Notes---

Current PCFG code: https://github.com/lakiw/pcfg_cracker

Quick description: PCFG stands for probabilistic context free grammar. In a nutshell what it does is create a model of how people create passwords, assigns different passwords a probability of occurring according to that model, and then generates password guesses in probability order.

How Brain could help: The current PCFG implementation has a lot of overhead in generating guesses because it attempts to generate the guesses in probability order. That's a nicer way of saying it is really slow. That being said, a PCFG based attack could be sped up dramatically, (perhaps on the order of Markov attacks), if you didn't care about generating guesses in probability order and instead took another approach like generating all guesses above a certain probability threshold. There's also exists the ability to distribute the work fairly easily. Having a Brain server help keep track of sessions could really help with this.

Current PCFG password cracking grammar:

Base Structures: This looks a lot like a hashcat mask if you squint, and it contains the basic structure of a password guess. The current fields are:

A=Alpha characters/Letters
D=Digits
O=Other/Special Characters
K=Keyboard Combos
X=conteXt sensitive replacements (such as <3, #1...)
M=Markov generated string, (currently adding OMEN support right now)

A typical base structure would look like:
Password123! = A8D3O1

Notice that the base structure does not cover capitalization or l33t mangling. That comes later...

There are also pre-terminal replacements. These are replacements for each of the variables in the base structure. This is also where transforms like case mangling happen. So you might have transforms such as:

A8 -> ULLLLLL -> Password

In the above example, the case mangling is applied as a mask that is then applied to the dictionary word.

What adds the "P" to PCFGs is that every transform has a probability associated with it. So the Base Structure has a probability, the Alpha capitailation mask has a probability, the word "password" has a probability, and the digits and special characters at the end have a probability. The final probability of a generated guess is the combined probability of all the transforms it took to generate it.

What this means from a Brain perspective is that you can describe a cracking session by a starting and ending probability for a given grammar in the current PCFG implementation. If you don't care about generating guesses in probability order, you could further split things up by describing attacks as the base structure, the starting probability, and the ending probability. Therefore you could distribute the work much like you can currently do it with traditional Mask attacks.

---Getting PCFGs into Brain---
I started looking at the brain code, and it appears the above saving and checking previously run PCFG attacks could be implemented in the brain_compute_attack function. That's likely the easy part. The hard part would be getting a PCFG style attack to run in Hashcat. Version 2 of the PCFG cracker was actually written in C++, but the current version 3 is a Python3 script. I'm starting a full re-write of the PCFG cracker though, so this is the time to bite the bullet and maybe go back to C++. My first goal is to re-write the training program which will likely be done in the next couple of months, which is why I expect to pick this up with more haste once the new year rolls around.
#2
Quote:I'm very excited about Hashcat's Brain server and have started to look into if it would be possible to incorporate a PCFG guesser into it.

There's a general misunderstanding here. There's no need to add code to the brain if you're adding a new generator to hashcat. What we need to do is to add PCFG to the slow_candidates. In that way it becomes useable directly from hashcat and also for overlays like hashtopolis etc. Beside of that, you can use hashcat in --stdout mode, which then simply runs PCFG through the brain and then outputs the candidates to console.

Quote:That's a nicer way of saying it is really slow.

Decision to call this is more like a technical reason. First Idea was to call it on-host genator but there was no shortcut left for -o and -O because both were in use. This story continued with other Idea and the only one left was -S.


Quote:A=Alpha characters/Letters
D=Digits
O=Other/Special Characters
K=Keyboard Combos
X=conteXt sensitive replacements (such as <3, #1...)
M=Markov generated string, (currently adding OMEN support right now)

A typical base structure would look like:
Password123! = A8D3O1

Thanks for the explanation. That sounds like a good plan actually. First thing that pops into my mind is it would be ideal to allow multiple of such "masks" to be executed sequentially. And then of course some sort of reader/trainer to create such masks and eventually order them by probability.



Good thing is, to get this flying, we can distribute the effort. 
  • First thing is that PCFG needs to provide five functions for hashcats slow candidate interface to attach it to hashcat. This would be your part.
  • If you can write PCFG so that it provides these functions I'll embedd this into hashcat. There's no need for you to learn about all the internal structures. I'll extend the slow_candidates.c, add the PCFG commandline options and checks and setup the needed structures. I mean of course if you want to we can do this, but I think there's nothing wrong with teaming up and/or work separation.

The five functions in general are the following (there's no template yet so I'm more like thinking loud here):
  • sc_pcfg_init - A function which resets all internal structures of the generator as it would be started freshly from the commandline. It will also provide the mandatory and optional parameters a user can specify in a struct. It will return a context to work with. The context enables multi threading functionality.
  • sc_pcfg_keyspace - A function which simply returns the total number of candidates which the generator will create based on the parameter configuration. If the total number is unknown this has some disadvantages. For instance, the ETA can not be computed or it may not be possible to distribute it via hashtopolis. In this case return (u64) -1 and hashcat will assume the generator will give a negative returncode in the seek/next function (explained next).
  • sc_pcfg_seek - Seek to a specific candidate position. This is mandatory, the parameter will be just a number. Will also have a returncode if there's no such position
  • sc_pcfg_next - Output the next password candidate (based on the context)
  • sc_pcfg_shutdown - A cleanup function

If you can agree to this, I'll formalize the structure in a C header and you can extend it with PCFG parameters you need.

Ideally you can provide some sort of C code which I can simply include to slow_candidates.c then we can get this flying really fast. An alternative would be a .so and/or .dll library which I can hook up and which makes you free in the language you want to use.

- atom
#3
From what I know, PCFG-based algorithm can hardly provide functions like sc_pcfg_keyspace and sc_pcfg_seek as it has only a partial view of the ongoing candidates. Correct me if I'm wrong.

The fact that sc_pcfg_seek is mandatory sounds a bit too much requirement from my point of view. It means that only algorithms with an index function* (like Markov-based one from Arvind Narayanan) could be implemented. I think algorithms without index function are still worth them, even if you can't "resume" them or distribute their work. Of course you can bypass that lack of index function, but it would require to actually generate candidates without using them so there is no much interest to do that because the main bottleneck of such algorithms is the candidate generation, not candidate handling (memory copying, output, ...).

* By index function, I mean a function which takes a position integer N in input, and returns (or seek to) the candidate number N.
#4
(11-09-2018, 10:38 AM)atom Wrote: The five functions in general are the following (there's no template yet so I'm more like thinking loud here):
  • sc_pcfg_init - A function which resets all internal structures of the generator as it would be started freshly from the commandline. It will also provide the mandatory and optional parameters a user can specify in a struct. It will return a context to work with. The context enables multi threading functionality.
  • sc_pcfg_keyspace - A function which simply returns the total number of candidates which the generator will create based on the parameter configuration. If the total number is unknown this has some disadvantages. For instance, the ETA can not be computed or it may not be possible to distribute it via hashtopolis. In this case return (u64) -1 and hashcat will assume the generator will give a negative returncode in the seek/next function (explained next).
  • sc_pcfg_seek - Seek to a specific candidate position. This is mandatory, the parameter will be just a number. Will also have a returncode if there's no such position
  • sc_pcfg_next - Output the next password candidate (based on the context)
  • sc_pcfg_shutdown - A cleanup function

If you can agree to this, I'll formalize the structure in a C header and you can extend it with PCFG parameters you need.

- atom

Those five functions sound very reasonable and I understand your need for them. "seek" and "keyspace" will require a dramatically different way of generating guesses from a grammar that will mean that guesses are no longer generated in probability order. That's not a showstopper but I just want to highlight there will be trade-offs. 

Long story short, I'll need some time to think about how to create an efficient index function. Providing a generic template using the above functions would probably help with integrating other guess generators though so designing a repeatable manner to provide them will likely help for adding other non-pcfg attack modes in the future. 

As dumb as it sounds, one hesitation I have is for researchers thinking the "index compatible" version of PCFGs is the same as a probability order guess generator. By definition, if we don't use probability order for generating guesses, the resulting guess generator will be less precise/accurate. Maybe we could come up with a new name for the algorithm we use in Hashcat. I'm open to suggestions!

Quote:There's a general misunderstanding here. There's no need to add code to the brain if you're adding a new generator to hashcat.


As crazy as it sounds, I'm mulling adding support for using a hashcat brain server into the python pcfg cracker. One thing I've struggled with in the past is once you start a cracking session with the current PCFG, you are kind of locked in to running it to completion. Having the ability to not re-hash previously made guesses or attacks would be really useful.

One step at a time though. I'll start looking into different ways to provide those five functions to integrate PCFGs into Hashcat's slow guess generator. As Matlink mentioned there may also be use in adding support for guess generators to Hashcat that don't have an index function. As he pointed out, if you want to generate guesses in probability order with PCFGs, there's no way I can think of to provide a practical index function.
#5
OK, I can see where this is going. It's the reason why hashcat generators are exactly how they are. There's simply no way to seek in such probabilistic systems. I expected that actually, even I hoped that you solved that problem somehow.

Therefore I'll mark the seek function as optional. In case of PCFG, I'll simply call the sc_pcfg_next() that many times I need to seek, simply because there's no way around this "feature" in a distributed environment (not even in just a multi-gpu environment).

This will create a bottleneck but I don't see a different solution to this.
#6
(11-09-2018, 07:57 PM)atom Wrote: Therefore I'll mark the seek function as optional. In case of PCFG, I'll simply call the sc_pcfg_next() that many times I need to seek, simply because there's no way around this "feature" in a distributed environment (not even in just a multi-gpu environment).

This will create a bottleneck but I don't see a different solution to this.

I apologize since I don't think I was clear

1) Just like there are many types of Markov attacks, (Hashcat Markov, JtR/Narayanan Markov, JtR Incremental, OMEN, etc), there are many ways to generate guesses with PCFGs. These different ways have pluses and minuses.

2) There is no single "best" way to generate guesses with PCFGs

3) In my previous guess generators, I prioritized probability order since I was approaching it mostly from a research/academic standpoint. I wanted to have the most "accurate" password guess generator, at the expense of many other things such as performance. With this approach there is no good index function.

4) If you don't care about probability order, there are other methods that have indexing functions of various speeds. Heck, I wrote a proof of concept PCFG rainbow table back in the day when rainbow tables were still relevant. What this means is that a PCFG based approach could support sc_pcfg_seek

5) I need time to think about the best way to go about designing a version of PCFG that supports indexing and sc_pcfg_seek. It's not that it can't be done. It's more like there's a lot of ways it *can* be done and I'd like to weight the trade-offs between them. I'm sure I'll be asking for feedback/suggestions/help/advice on this as well since as I said, I mostly come from an academic background, so I value input from practitioners and the wider pw community.

6) I'd also like a new name for this approach so that it doesn't get confused with the probability order version.

7) If there is yet another version of an interface of hashcat's slow guess framework that doesn't require a "seek" function, I think that would be valuable for future work. My gut feeling though is that the Hashcat user-base would be more interested in an index version that has better support for restarting sessions, is faster with less memory usage, etc, even though it is significantly less accurate/precise. What that means is I'll probably focus on providing you a version of a pcfg guess generator that supports the five functions you requested.

8) Once again to set expectations, I probably will not start work on coding this until January. Talking about it is a lot easier than doing it ;p

9) To further set expectationsI'm a slow coder who writes slow code as can be seen by the current PCFG development history. 

On a side note:

Quote:First thing that pops into my mind is it would be ideal to allow multiple of such "masks" to be executed sequentially. And then of course some sort of reader/trainer to create such masks and eventually order them by probability.


That feature actually existed in the version 2 of the PCFG guess generator, and is also roughly how the CMU version of the PCFG cracker works too. I never ported it to version 3, but that approach is certainly viable. There's other optimizations you can take to that approach as well such as increasing "smoothing" of the training data so that each base structure/mask would generate more guesses, thus reducing the on-disk space required. As I said earlier, there's lots of ways to go about this and each way has it's own set of trade-offs that need to be balanced Smile
#7
Quote:1) Just like there are many types of Markov attacks, (Hashcat Markov, JtR/Narayanan Markov, JtR Incremental, OMEN, etc), there are many ways to generate guesses with PCFGs. These different ways have pluses and minuses.

2) There is no single "best" way to generate guesses with PCFGs

3) In my previous guess generators, I prioritized probability order since I was approaching it mostly from a research/academic standpoint. I wanted to have the most "accurate" password guess generator, at the expense of many other things such as performance. With this approach there is no good index function.

I don't see a reason why not to stick to prioritized probability. That's what we want actually. Again I understood that you can not have a index function, but by calling the next function N times, that is a way to seek. It's slow, but it will work.

Quote:4) If you don't care about probability order, there are other methods that have indexing functions of various speeds. Heck, I wrote a proof of concept PCFG rainbow table back in the day when rainbow tables were still relevant. What this means is that a PCFG based approach could support sc_pcfg_seek

I care, that's the most important part.

Quote:6) I'd also like a new name for this approach so that it doesn't get confused with the probability order version.

Sure, but keep in mind that this is not what we're looking for.

Quote:7) If there is yet another version of an interface of hashcat's slow guess framework that doesn't require a "seek" function, I think that would be valuable for future work. My gut feeling though is that the Hashcat user-base would be more interested in an index version that has better support for restarting sessions, is faster with less memory usage, etc, even though it is significantly less accurate/precise. What that means is I'll probably focus on providing you a version of a pcfg guess generator that supports the five functions you requested.

Maybe I'll add princeprocessor as an example.
#8
(11-11-2018, 03:01 PM)atom Wrote: I don't see a reason why not to stick to prioritized probability. That's what we want actually. Again I understood that you can not have a index function, but by calling the next function N times, that is a way to seek. It's slow, but it will work.

Understood. Just trying to give you options Smile

My current plan is to finish up the non-hashcat version 4.0 of the PCFG code. There's a couple of features that I really want to see how they work in practice, most notably OMEN integration, multi-words, and L33tspeek letter replacement. There's also support for other cracking techniques like optimized wordlists for PRINCE and advanced Masks that I think might be interesting. I'd like to see how they work/don't work before I start writing the Hashcat code. My gut feeling is we might want to leave out OMEN integration at least for the first version of the hashcat pcfg cracker since that adds a lot of complexity. Perhaps OMEN could be added as a unique slow attack mode, and then afterwards I could add it back in using that code to the PCFG cracking sessions too?

For Hashcat, I'm probably going to revisit the version 2.0 code which was in C++. That code is currently in the pcfg github repo under: https://github.com/lakiw/pcfg_cracker/tr...fg_cracker

There's been a lot of improvements made in the grammar since the 2.0 code, and the 2.0 code got a bit brittle/broken when I tried to add passphrase cracking into it which is part of what prompted me to rewrite everything in Python. That's another way of saying I will take pieces of that but plan on rewriting most of it, but it at least can provide some general info if anyone else is interested in this.

Referring back to the five functions to support, sc_pcfg_keyspace.  I can calculate a keyspace for a PCFG. The question is for non-trivial grammars trained on standard sized password sets, the keyspace that is covered by the PCFG is not something that will ever be finished even if it was a fast guess generator. Would that still be useful to have?
#9
Happy New Year!

I don't expect to make fast progress, but as I said earlier I have some time to start looking at adding a PCFG attack as a slow candidate guess mode.

One thing that would really help me is a "hello world" equivalent example that I could start building around. Even just a pointer about where to start hooking into, what I should do to help the build process, etc, would be helpful. As I said, I expect this to be a long process on my end so there is no hurry if this doesn't exist already. 

Thanks!