Skip to content

Data Mining

Data mining is the automated gathering of data from the internet. Also commonly known as screen scraping, web scraping, web harvesting, etc.

Table of contents

  • Crawling vs Scraping
  • Ways it is done?
  • [Legalities]
  • [Ethics]
  • Tasks
  • Finding
  • Parsing
  • Crawling

Classifying the project

  • crawling or scraping
  • broad or targeted
  • one-time or long running
  • One-time projects result in producing files in formats like csv, json, etc
  • Long running projects involve monitoring, re-scanning for new data, update data, etc
  • Data needed for collection or more in-depth analysis

Using Python

Python has a default recursion limit (the number of times a program can recursively call itself) of 1,000. Because Wikipedia’s network of links is extremely large, the program will eventually hit that recursion limit and stop, unless you put in a recursion counter or something to prevent that from happening.

With server-side redirects, you usually don’t have to worry. If you’re using the urllib library with Python 3.x, it handles redirects automatically! If you’re using the Requests library, make sure to set the allow_redirects flag to True:

r = requests.get('http://github.com', allow_redirects=True)

Affiliates

3rd party tools & APIs

Examples sites to scrape