Discussion in 'Web Design and Development' started by Utchin, Dec 28, 2013.

    I want to set up a website that will scrape a number of website, starting with one but the option to expand into 5+. The first website has around 20,000 pages and I have a sitemap of the pages I want to scape.

    I have my PHP code to scrape and do all the bits it needs to do, question is, what is the best implimentation to do this.

    I was looking at a amazon EC2 and run the script on there as a cron, but as a cron job it seems rather old and out dated?

    any one able to advise on a better method?
    Ask a lawyer first. And seriously, I don't think you will find anyone here willing to help you with that kind of thing.
    Eh, I'll bite. First off you're doing it wrong if you're considering PHP. You need to scrape via an event/async based system. Something like Python or NodeJS. In-fact NodeJS will be your best bet because it has tools involved in it.

    Secondly, if you're planning to scrape via a sitemap you're doing it wrong. Sitemap's don't contain all the pages or data. They contain what the site wants you to see. Scrape the content; then you'll want to look @ the site's robot.txt and sitemap.xml. Then you'll want to look at the google results of "" and compare that to pages you've already found.

    Thirdly you'll want to keep track of all of the sites, pages, and data you've found. You'll need a strong database that can be read/written multiple times.

    In the end you'll need some powerful machines or numerous virtual machines to properly scrape for data.
    I wrote a simple scraper way back when. 20k pages! Sheesh, i was just looking to pull the headlines and pop in a link on a few dozen pages.

    Definitely contact a lawyer ( excellent advice above ). The server you are scraping will be logging your activity. I second NodeJS as a better option.

