Setting up a web scraping system

Utchin · Dec 28, 2013

I want to set up a website that will scrape a number of website, starting with one but the option to expand into 5+. The first website has around 20,000 pages and I have a sitemap of the pages I want to scape.

I have my PHP code to scrape and do all the bits it needs to do, question is, what is the best implimentation to do this.

I was looking at a amazon EC2 and run the script on there as a cron, but as a cron job it seems rather old and out dated?

any one able to advise on a better method?

gnasher729 · Dec 28, 2013

Utchin said:
I want to set up a website that will scrape a number of website, starting with one but the option to expand into 5+. The first website has around 20,000 pages and I have a sitemap of the pages I want to scape.

I have my PHP code to scrape and do all the bits it needs to do, question is, what is the best implimentation to do this.

I was looking at a amazon EC2 and run the script on there as a cron, but as a cron job it seems rather old and out dated?

any one able to advise on a better method?

Ask a lawyer first. And seriously, I don't think you will find anyone here willing to help you with that kind of thing.

Game64 · Jan 1, 2014

Utchin said:
I want to set up a website that will scrape a number of website, starting with one but the option to expand into 5+. The first website has around 20,000 pages and I have a sitemap of the pages I want to scape.

I have my PHP code to scrape and do all the bits it needs to do, question is, what is the best implimentation to do this.

I was looking at a amazon EC2 and run the script on there as a cron, but as a cron job it seems rather old and out dated?

any one able to advise on a better method?

Eh, I'll bite. First off you're doing it wrong if you're considering PHP. You need to scrape via an event/async based system. Something like Python or NodeJS. In-fact NodeJS will be your best bet because it has tools involved in it.

Secondly, if you're planning to scrape via a sitemap you're doing it wrong. Sitemap's don't contain all the pages or data. They contain what the site wants you to see. Scrape the content; then you'll want to look @ the site's robot.txt and sitemap.xml. Then you'll want to look at the google results of "site

agename.com" and compare that to pages you've already found.

Thirdly you'll want to keep track of all of the sites, pages, and data you've found. You'll need a strong database that can be read/written multiple times.

In the end you'll need some powerful machines or numerous virtual machines to properly scrape for data.

elyrly · Jan 7, 2014

if you were to code in Ruby check out Nokogiri

960design · Jan 8, 2014

I wrote a simple scraper way back when. 20k pages! Sheesh, i was just looking to pull the headlines and pop in a link on a few dozen pages.

Definitely contact a lawyer ( excellent advice above ). The server you are scraping will be logging your activity. I second NodeJS as a better option.

Search

Search

Setting up a web scraping system

Utchin

macrumors newbie

gnasher729

Suspended

Game64

macrumors member

elyrly

macrumors newbie

960design

macrumors 68040

Our Staff