Is anyone mining IMDb?
#1
At the Passwords 12 conference, S. Rraveau explained his mining the Wikimedia sites to create his famous wikipedia-wordlist-sraveau.

And at Passwords 13, IT3700 & joshdustin presented Password Cracking, From "abc123" to "thereisnofatebutwhatwemake".

The academics have:
The Corpus of Global Web-Based English (GloWbE) is composed of 1.9 billion words from 1.8 million web pages in 20 different English-speaking countries. The corpus was created by Mark Davies of Brigham Young University, and it was released in April 2013.

It seems that IMDb would be a good source to mine, but they seem to frown on that:
You may not use data mining, robots, screen scraping, or similar online data gathering and extraction tools on our website.

But is anybody doing any password/passphrase oriented mining, anyway?

Update: I just noticed the article How the Bible and YouTube are fueling the next frontier of password cracking at
http://arstechnica.com/security/2013/10/...-cracking/
Reply
#2
Never did any mining on this site but maybe this data is useful to you

googled : imdb database downloads

top link

Quote:A subset of the IMDb plain text data files is available from our FTP sites
...

Please refer to the copyright/license information listed in each file for instructions on allowed usage. The data is NOT FREE although it may be used for free in specific circumstances.
Reply
#3
I guess their restriction is why nobody seems to be saying they are doing anything with it:

The data can only be used for personal and non-commercial use and must not be altered/republished/resold/repurposed to create any kind of online/offline database of movie information.

Maybe the data is already being used, but not mentioned as such.
Reply
#4
To do something similar, but on a smaller scale (hard disk space, bandwidth, time, etc.), I'm going to start with three of the Wikimedia projects: (1) Wikipedia, (2) Wiktionary, (3) Wikiquote, as there would be a common method of downloading files and parsing. Specifically, just the titles from Wikipedia and Wiktionary, and just the quotes from Wikiquotes.

That should have a high signal to noise ratio, and not need many big files downloaded.

The process would need to be done periodically, to get the benefit of "trending" topics being added. Anything relevant on the web will tend to wind up in the Wikimedia universe, eventually.
Reply
#5
Some nice pre-processed lists of sites like this can be found at:

http://human0id.net/
http://human0id.net/dicts/
Reply
#6
This is cool stuff, thanks
Reply
#7
Well, it was your re-tweets that brought that site to my attention.
Reply
#8
(06-26-2014, 10:02 PM)Kgx Pnqvhm Wrote: Well, it was your re-tweets that brought that site to my attention.

I've scraped IMDB with Powershell before (although not for password cracking reasons). I was trying to correlate IMDB ratings to what's on my OnDemand list. Anyway, my spider wasn't blocked at all.
Reply