How to crawl a website to obtain a database of information?

Discussion in 'Web Design and Development' started by gmanist1000, Aug 5, 2013.

  1. gmanist1000 macrumors 68030

    gmanist1000

    Joined:
    Sep 22, 2009
    #1
    I have very low knowledge on this so please be as specific as possible

    I am looking for a way to crawl a website and extract information from their database.

    Example 1 : I wish to write a script to look at IMDB and gather information from upcoming movie releases in December 2013. Then, I want that information to be written somewhere so I can post it on my own website.

    Example 2: I wish to write a script to look at IGN.com and gather information about the upcoming game releases, then I want that information to be written somewhere so I can post it on my own website.

    Any ideas out there? I don't know where to begin, nor have I tried doing this before.
     
  2. SrWebDeveloper macrumors 68000

    SrWebDeveloper

    Joined:
    Dec 7, 2007
    Location:
    Alexandria, VA, USA
    #2
    Crawlers are bots that systematically browse part or all pages of an entire site intended for export to copy and analyze site. Search engines use crawlers to gather data for indexing, for example.

    Based on your description you should explore API usage and CURL/PHP methods in my opinion.

    API:

    Allows you (client) to query and extract data from the web site (server) and then format it for import into your site where desired. These days most API's are XML, JSON oriented and intended to work multi-platform. Some are better than others of course and documentation can be tricky.

    IMDB API info: http://stackoverflow.com/questions/1966503/does-imdb-provide-an-api
    IGN V3 API video/info: http://www.slideshare.net/lobster1234/igns-v3-api

    CURL:

    CURL is a deamon on the server that uses HTTP to send/retrieve data between you (client) and remote web site (server) much like a browser, and could be used for crawling. Many coding languages offer wrapper support to make using CURL extremely easy. CURL supports single or parallel connections and allows you to retrieve page content programmatically so you can parse (extract via code) what you need. On a side note, PHP/CURL is amazing and powerful, plus the resulting page (HTML) can be returned as a string for easy parsing.

    http://www.php.net/manual/en/curl.examples-basic.php

    Note:

    API's are generally more secure and use less bandwidth although some throttle connections. API's send and receive structured data, you send/get only what the API allows, platform independent. If the server page changes format, the API code most likely will not need to be changed as well. CURL is easy to use, and no different than browsing to a site, but some hosts will block you if you abuse connections. Plus you constantly need to adjust your code to account for major changes on the server side which are inevitable over time.

    Hope this helps you.

    Cheers.

    :cool:
     

Share This Page