Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Utchin

macrumors newbie
Original poster
Jan 10, 2011
18
0
I want to set up a website that will scrape a number of website, starting with one but the option to expand into 5+. The first website has around 20,000 pages and I have a sitemap of the pages I want to scape.

I have my PHP code to scrape and do all the bits it needs to do, question is, what is the best implimentation to do this.

I was looking at a amazon EC2 and run the script on there as a cron, but as a cron job it seems rather old and out dated?

any one able to advise on a better method?
 

gnasher729

Suspended
Nov 25, 2005
17,980
5,565
I want to set up a website that will scrape a number of website, starting with one but the option to expand into 5+. The first website has around 20,000 pages and I have a sitemap of the pages I want to scape.

I have my PHP code to scrape and do all the bits it needs to do, question is, what is the best implimentation to do this.

I was looking at a amazon EC2 and run the script on there as a cron, but as a cron job it seems rather old and out dated?

any one able to advise on a better method?

Ask a lawyer first. And seriously, I don't think you will find anyone here willing to help you with that kind of thing.
 

Game64

macrumors member
Jan 21, 2011
56
17
Las Vegas, NV
I want to set up a website that will scrape a number of website, starting with one but the option to expand into 5+. The first website has around 20,000 pages and I have a sitemap of the pages I want to scape.

I have my PHP code to scrape and do all the bits it needs to do, question is, what is the best implimentation to do this.

I was looking at a amazon EC2 and run the script on there as a cron, but as a cron job it seems rather old and out dated?

any one able to advise on a better method?


Eh, I'll bite. First off you're doing it wrong if you're considering PHP. You need to scrape via an event/async based system. Something like Python or NodeJS. In-fact NodeJS will be your best bet because it has tools involved in it.

Secondly, if you're planning to scrape via a sitemap you're doing it wrong. Sitemap's don't contain all the pages or data. They contain what the site wants you to see. Scrape the content; then you'll want to look @ the site's robot.txt and sitemap.xml. Then you'll want to look at the google results of "site:pagename.com" and compare that to pages you've already found.

Thirdly you'll want to keep track of all of the sites, pages, and data you've found. You'll need a strong database that can be read/written multiple times.

In the end you'll need some powerful machines or numerous virtual machines to properly scrape for data.
 
  • Like
Reactions: normansmith

960design

macrumors 68040
Apr 17, 2012
3,691
1,548
Destin, FL
I wrote a simple scraper way back when. 20k pages! Sheesh, i was just looking to pull the headlines and pop in a link on a few dozen pages.

Definitely contact a lawyer ( excellent advice above ). The server you are scraping will be logging your activity. I second NodeJS as a better option.
 
Last edited:
  • Like
Reactions: normansmith
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.