Data Mining

Data mining is the automated gathering of data from the internet. Also commonly known as screen scraping, web scraping, web harvesting, etc.

Classifying the project

crawling or scraping
broad or targeted
one-time or long running
One-time projects result in producing files in formats like csv, json, etc
Long running projects involve monitoring, re-scanning for new data, update data, etc
Data needed for collection or more in-depth analysis

Using Python

Python has a default recursion limit (the number of times a program can recursively call itself) of 1,000. Because Wikipedia’s network of links is extremely large, the program will eventually hit that recursion limit and stop, unless you put in a recursion counter or something to prevent that from happening.

With server-side redirects, you usually don’t have to worry. If you’re using the urllib library with Python 3.x, it handles redirects automatically! If you’re using the Requests library, make sure to set the allow_redirects flag to True:

r = requests.get('http://github.com', allow_redirects=True)

Helpful links

Affiliates

3rd party tools & APIs

Google Search Central - to learn more about how search engines find and tag the data that they display in special search result features and enhancements.
WebScraping API

Examples sites to scrape

https://records.nhl.com/
https://www.nfl.com/stats/player-stats/
Wikipedia API
Six Degrees of Kevin Bacon
The Oracle of Bacon