11-23-2011, 03:08 AM
(This post was last modified: 11-23-2011, 03:20 AM by Kgx Pnqvhm.)
This item was in my mailbox today:
It is now possible to download large amounts of "n-grams" data from the COCA corpus for offline use from http://www.ngrams.info. This is in addition to the data on the top 500,000 word forms and the top 5,000 lemmas in COCA, which has been available for free from http://www.wordfrequency.info for the past few months.
Starting today, registered users can freely download large n-gram datasets, which contain the frequency of the one million most frequent 2, 3, 4, and 5-word sequence in the corpus, and then use this data offline for research and teaching. Other versions of the n-grams datasets allow users to download tens and even hundreds of millions of rows of data.
In addition to the COCA data, starting today you can also download n-grams data from the 400 million word Corpus of Historical American English (COHA). This data allows you to search offline to see the frequency of every word, and every 2, 3, 4, and 5-gram that occurs at least three times in the corpus, along with its frequency in each of the 20 decades (1810s-2000s).
For more information on this new n-grams data, please see http://www.ngrams.info.
Best,
Mark Davies
Brigham Young University
In the big picture, many types of lists and analyses are needed. Leaked passwords, already cracked passwords, those tools that make a target specific word list, etc.
But a good list of real words is a good start to then apply good mangling rules to.
It is now possible to download large amounts of "n-grams" data from the COCA corpus for offline use from http://www.ngrams.info. This is in addition to the data on the top 500,000 word forms and the top 5,000 lemmas in COCA, which has been available for free from http://www.wordfrequency.info for the past few months.
Starting today, registered users can freely download large n-gram datasets, which contain the frequency of the one million most frequent 2, 3, 4, and 5-word sequence in the corpus, and then use this data offline for research and teaching. Other versions of the n-grams datasets allow users to download tens and even hundreds of millions of rows of data.
In addition to the COCA data, starting today you can also download n-grams data from the 400 million word Corpus of Historical American English (COHA). This data allows you to search offline to see the frequency of every word, and every 2, 3, 4, and 5-gram that occurs at least three times in the corpus, along with its frequency in each of the 20 decades (1810s-2000s).
For more information on this new n-grams data, please see http://www.ngrams.info.
Best,
Mark Davies
Brigham Young University
(11-22-2011, 05:33 AM)atom Wrote: i dont think using "base-words" is a good idea in password cracking.My idea about corpora as a starting point for word lists was a response to the point made that most of the so-called word lists on the Internet are basically gabage. Using them as input lists for mangling is a big waste of time.
In the big picture, many types of lists and analyses are needed. Leaked passwords, already cracked passwords, those tools that make a target specific word list, etc.
But a good list of real words is a good start to then apply good mangling rules to.