10-23-2013, 12:22 AM
To do something similar, but on a smaller scale (hard disk space, bandwidth, time, etc.), I'm going to start with three of the Wikimedia projects: (1) Wikipedia, (2) Wiktionary, (3) Wikiquote, as there would be a common method of downloading files and parsing. Specifically, just the titles from Wikipedia and Wiktionary, and just the quotes from Wikiquotes.
That should have a high signal to noise ratio, and not need many big files downloaded.
The process would need to be done periodically, to get the benefit of "trending" topics being added. Anything relevant on the web will tend to wind up in the Wikimedia universe, eventually.
That should have a high signal to noise ratio, and not need many big files downloaded.
The process would need to be done periodically, to get the benefit of "trending" topics being added. Anything relevant on the web will tend to wind up in the Wikimedia universe, eventually.