Dump Scraper
#1
As you already know, Internet is full of passwords (plain and hashed ones): when a leak occurs, usually it's posted to PasteBin.
The pace of these dumps is so high that it's not humanly possible to collect them all, so we have to rely on a bot, scraping PasteBin site for interesting filea.

Dump Monitor will exactly do this: every time some leaked information are posted on PasteBin, he will tweet the link.

Sadly Dump Monitor is not very efficient: inside its tweets you will find a lot of "false positives" (debug data, log files, Antivirus scan results) or stuff we're not interested into (RSA private keys, API keys, list of email addresses).

Moreover, once you have the raw data you need to extract such information and remove all the garbage.

That's the reason why Dump Scraper was born: inside this repository you will find several scripts to fetch the latest tweets from Dump Monitor, analyze them (discarding useless files) and extract the hashes or the passwords.

https://github.com/tampe125/dump-scraper/releases

Please remember to read the wiki before continuing:
https://github.com/tampe125/dump-scraper/wiki

Finally, this is a super-alpha release, so things may be broken or not working as expected. Moreover, I know it's a kind of "hackish": a single program with a GUI would be 100 times better. Sadly I'm running out of time and I don't know anything about Python GUI development: if anyone wants to contribute, it would be more than welcome!

Please leave here your thoughts and opinions.
#2
Many thanks!
#3
Can get it to work on ubuntu, I filled in the twitter auth keys and renamed the settings-dist.json and installed dependences.

PHP 5.5.22-1+deb.sury.org~precise+1 | Python 2.7.3

php scrape.php

PHP Warning: require_once(vendor/autoload.php): failed to open stream: No such file or directory in /home/xxxxx/dump-scraper/scrape.php on line 8
PHP Fatal error: require_once(): Failed opening required 'vendor/autoload.php' (include_path='.:/usr/share/php:/usr/share/pear') in /home/xxxxx/dump-scraper/scrape.php on line 8
#4
ah crap, I forgot to put that in the wiki!
You have to get composer (https://getcomposer.org/download/) and run:

php composer.phar install

sigh, that's the risk of always working on a dev environment...
However don't worry, if everything goes smooth, I think I'll release a new Python only version, with a single entry point.
#5
That done the trick! Thanks
#6
One more problem.

Doesn't seem to create the data folder after processing the tweets with "php scrape.php" and is also displaying a php notice in term.

Code:
processed 2000 tweets
    Found 0 removed tweets in this batch
PHP Notice:  Trying to get property of non-object in /home/xxxxx/dump-scraper/scrape.php on line 97

Notice: Trying to get property of non-object in /home/xxxxxx/dump-scraper/scrape.php on line 97
PHP Notice:  Trying to get property of non-object in /home/xxxxx/dump-scraper/scrape.php on line 100

Notice: Trying to get property of non-object in /home/xxxxx/dump-scraper/scrape.php on line 100
PHP Notice:  Trying to get property of non-object in /home/xxxxx/dump-scraper/scrape.php on line 103

Notice: Trying to get property of non-object in /home/xxxxx/dump-scraper/scrape.php on line 103
PHP Notice:  Trying to get property of non-object in /home/xxxxx/dump-scraper/scrape.php on line 103

Notice: Trying to get property of non-object in /home/xxxxx/dump-scraper/scrape.php on line 103

    processed 2001 tweets
    Found 0 removed tweets in this batch

Total processed tweets: 2001
#7
ignore the notice error, it seems the tweet doesn't have any data (I'll add a check for it).
Please manually create the folder data/raw

Tomorrow I'll release a new version addressing these issues...
#8
Got everything working up till "python classify.py"

running Ubuntu 14.04, python 2.7, scipy'0.13.3' , sklearn'0.15.2'.

Error: http://pastebin.com/e2QMSmKs
#9
can you please post the training/features.csv file?
I think there are some invalid values inside that.
You can upload it to pastebin and put the link here.

Thank you very much!
#10
After you had mentioned that the training csv had invalid information i re looked at the Wiki and noticed that the training folder structure was "train" instead of the more logical "trash" i had a hunch this was a typeo so made the adjustment and all works fine now thanks!
http://prntscr.com/6j5lck