Setting up a web scraping system

Discussion in 'Web Design and Development' started by Utchin, Dec 28, 2013.

  1. Utchin macrumors newbie

    Joined:
    Jan 10, 2011
    #1
    I want to set up a website that will scrape a number of website, starting with one but the option to expand into 5+. The first website has around 20,000 pages and I have a sitemap of the pages I want to scape.

    I have my PHP code to scrape and do all the bits it needs to do, question is, what is the best implimentation to do this.

    I was looking at a amazon EC2 and run the script on there as a cron, but as a cron job it seems rather old and out dated?

    any one able to advise on a better method?
     
  2. gnasher729 macrumors P6

    gnasher729

    Joined:
    Nov 25, 2005
    #2
    Ask a lawyer first. And seriously, I don't think you will find anyone here willing to help you with that kind of thing.
     
  3. Game64 macrumors member

    Joined:
    Jan 21, 2011
    Location:
    Las Vegas, NV
    #3

    Eh, I'll bite. First off you're doing it wrong if you're considering PHP. You need to scrape via an event/async based system. Something like Python or NodeJS. In-fact NodeJS will be your best bet because it has tools involved in it.

    Secondly, if you're planning to scrape via a sitemap you're doing it wrong. Sitemap's don't contain all the pages or data. They contain what the site wants you to see. Scrape the content; then you'll want to look @ the site's robot.txt and sitemap.xml. Then you'll want to look at the google results of "site:pagename.com" and compare that to pages you've already found.

    Thirdly you'll want to keep track of all of the sites, pages, and data you've found. You'll need a strong database that can be read/written multiple times.

    In the end you'll need some powerful machines or numerous virtual machines to properly scrape for data.
     
  4. elyrly macrumors newbie

    Joined:
    Jul 21, 2011
  5. 960design, Jan 8, 2014
    Last edited: Jan 8, 2014

    960design macrumors 68020

    Joined:
    Apr 17, 2012
    Location:
    Destin, FL
    #5
    I wrote a simple scraper way back when. 20k pages! Sheesh, i was just looking to pull the headlines and pop in a link on a few dozen pages.

    Definitely contact a lawyer ( excellent advice above ). The server you are scraping will be logging your activity. I second NodeJS as a better option.
     

Share This Page