Data mining - how do I parse specific info from HTML files?

Discussion in 'Mac Programming' started by glossywhite, Sep 7, 2009.

  1. glossywhite macrumors 65816

    glossywhite

    Joined:
    Feb 28, 2008
    #1
    Hi there. I just started downloading the WHOLE of an online catalogue, by using:

    Code:
    wget -rH -Dserver.com http://www.server.com/
    
    (just an example URL)

    I now have a folder FULL of HTML pages, one for each product. I need to filter out the same data for ALL products on each page, and send that to a CSV file.

    Can anyone tell me how I would do this, giving examples for someone who has SOME Unix command line experience (mediocre) but not a "Guru"?.

    Thanks
     
  2. angelwatt Moderator emeritus

    angelwatt

    Joined:
    Aug 16, 2005
    Location:
    USA
    #2
    You could find a module for a language such as Perl that interact with the DOM structure of the page and use that to parse things out. Otherwise, you could use regular expressions to get at the information.
     
  3. glossywhite thread starter macrumors 65816

    glossywhite

    Joined:
    Feb 28, 2008
    #3
    Document Object Model, right?. As for PERL, forget that - I know nothing about it, and am unwilling to learn a new language just for one job. :D thanks... could you explain how I would do what you suggest?. Thankyou
     
  4. angelwatt Moderator emeritus

    angelwatt

    Joined:
    Aug 16, 2005
    Location:
    USA
    #4
    Yup, Document Object Model. Perl was just one example. I don't know what languages you know. I haven't needed to parse HTML like this so don't know of any good tools off hand. I ave messed some with XSLT, but it would require the HTML to be XML valid to work correctly, and most web sites are not.

    Regular expressions take some time to learn. Here's an example though.
    HTML:
    <h1>Heading</h1>
    <h2>Other</h2>
    <p>paragraph</p>
    Regex:
    Code:
    /<(h[1-6]).*?>(.*?)</\1>/gi
    That regex would capture all the headings on the page. It would be stored it capture item 2 (the contents of the parentheses). There's many online resources for learning regular expressions. I even created an online regular expression testing tool. It has some resources at the bottom that you would want to look at as well if you want to try learning them.
     

Share This Page