Passwords from E-books
#1
In addition to the numerous wordlists that can be found on the Internet, it seems that no one has gone to the effort of converting E-book archives to plaintext and parsing out the data in various ways.  Such data could produce a new genre of dictionary files for use with hashcat (or any other similar program).

There would be many challenges for such a project.  Here are a few.

1. The required disk storage.  Some sites that I have seen for E-books for just one genre held close to 80 GB (or more) of E-books.  Several terabytes could be necessary for someone who wants to work with a large data set.

2. Where to get the data.  There are numerous websites that have E-books available.  Obtaining that content is a challenge and task all by itself.

3. Converting the available formats.  Epub and pdf are common, but working with the data effectively requires plain-text format.  Calibre seems to be the most obvious choice for converting to plain-text, but there may be other options that I don't know about.
https://calibre-ebook.com/download

4. How to parse the data.  Just using all the words that are present in the text is not adequate and completely misses the point of getting all of this content in plain-text format.  Pulling out phrases and sentences, delimited by commas, periods and double quotes would yield candidates that could be useful.  Having sentences or phrases with the spaces removed or the spaces replaced with other characters also seems worthwhile.  Also, one of the methods that I once saw on YouTube for choosing a "secure" password was to take a sentence from your favorite book and use the first letters of every word in that sentence as a password.

5. Sentence length.  Entire sentences from books would be mostly useless in a lot of cases due to the length limitations of several hash types in hashcat.  I haven't investigated whether or not those length limitations are present in hashcat's alternative, John the Ripper, for example, but for several of the common fast hashes (MD5 and SHA1) full sentences would be rejected due to length.

Has anyone here ever bothered with this?  It seems like a lot of work with not a lot of reward.
Reply
#2
Smile 
Thanks mate for the info.
Reply
#3
I've been exploring this for a while, and the bigger picture is anything written or spoken that's "cute" enough to be a pass-phrase. There are a lot of articles about this out there, e.g. talking about gathering books from Project Gutenberg, movie quotes from IMDb, everything on the Wikipedia, etc.

Even if one has the mechanics to do all this, what is missing (at least for me) is a way of ranking the lines or quotes in terms of how likely they are to actually be used.

We'd need sort of "PACK" tool but for phrases. Imagine a tool that analyzes a cracked list and says this one is from Moby Dick, while that one is the latest movie at the box office, and another is from Star Trek......

Meanwhile, some people just string words together that may be nonsense (to anyone else but them).
Reply